mmap_dataset.py

File Path: src/data/mmap_dataset.py

Purpose: High-performance PyTorch Dataset backed by numpy.memmap, allowing training on datasets larger than RAM.

Overview

Instead of loading thousands of small files (Lazy) or the whole dataset into RAM, this class maps a single giant binary file (train_X.mmap) into virtual memory. The OS handles paging pages in/out of RAM as needed.

Class `MmapKArSLDataset`

Inherits: torch.utils.data.Dataset

`init`

Logic:

Loads metadata:
- X_shape.npy: Total dimensions of the giant array.
- y.npz: Labels array.
- X_map_samples_lens.npy: Length of each sample within the giant array.

Memmap: Creates a read-only view (mode="r") of the data.

self.X = np.memmap(data_path, dtype="float32", mode="r", shape=X_shape)

Offset Calculation: Pre-calculates the start index (X_offsets) for every sample to allow O(1) random access.

`getitem(index)`

Logic:

usage index to find start_offset and length.
Slices the memmap (Zero-copy operation): raw = self.X[start:start+len].
Applies TSNSampler to get fixed size.
Applies DataAugmentor.

Performance Note

TIP

This is the recommended dataset for training on high-performance clusters or machines with fast SSDs (NVMe). It significantly increases GPU utilization by removing CPU/IO bottlenecks.

Depends On:

mmap_dataset_preprocessing.py - Creates the mmap files
constants.py - MMAP_PREPROCESSED_DIR

Used By:

dataloader.py

Arabic Sign Language

Explorer

mmap_dataset.py

mmap_dataset.py

Overview

Class `MmapKArSLDataset`

`init`

`getitem(index)`

Performance Note

Table of Contents

Graph View

Backlinks

Arabic Sign Language

Explorer

mmap_dataset.py

mmap_dataset.py

Overview

Class MmapKArSLDataset

__init__

__getitem__(index)

Performance Note

Related Documentation

Table of Contents

Graph View

Backlinks

Class `MmapKArSLDataset`

`init`

`getitem(index)`