Enabling random access to large datasets on disk for PyTorch training and other use cases

These details have not been verified by PyPI

Project links

Project description

Data Forager

Enabling random access to large datasets on disk for PyTorch training and other use cases.

Why Data Forager?

When training on large datasets (billions of tokens), you face a dilemma:

Option 1: Load into memory

Works for small datasets
Doesn't scale — a 1B token corpus needs gigabytes of RAM just for the text

Option 2: Streaming / Iterable datasets

Scales to any size
But: no true random shuffling (only buffer-based approximation)
More complex handling: can't use len(dataset), unclear epoch boundaries, custom resumption logic needed
Can't use standard PyTorch DataLoader(shuffle=True)

Why shuffling matters: True random shuffling reduces gradient variance, prevents learning dataset ordering artifacts, and is especially important when mixing multiple data sources.

Data Forager's solution: Build a compact byte-offset index that enables O(1) random access to any sample via seek(). Your training code stays simple — large datasets work exactly like small ones:

# Same code for 1K samples or 1B samples
dataset = JsonlDataset.create_from_index_on_filesystem('./data')
loader = DataLoader(dataset, batch_size=32, shuffle=True)  # True random shuffling!

for batch in loader:
    ...

No special iteration logic, no buffer management, no epoch hacks.

Quick Start

Use Case 1: Random Access to JSONL Files

from data_forager.indexers.jsonl_indexer import create_default_jsonl_indexer
from data_forager.datasets.jsonl import JsonlDataset
from torch.utils.data import DataLoader

# One-time indexing (run once, reuse forever)
indexer = create_default_jsonl_indexer('./data')
indexer()
# Creates: ./data/index/file_location.txt, ./data/index/sample_locations.bin

# Training: random access with standard DataLoader
dataset = JsonlDataset.create_from_index_on_filesystem('./data')
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
    # batch is a list of dicts (parsed JSON objects)
    texts = [sample['text'] for sample in batch]
    ...

Use Case 2: Tokenized Samples for Language Model Training

from data_forager.indexers.tokenization_indexer import create_tokenize_and_index_jsonl_text_func
from data_forager.datasets.tokens import TokensDataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

# One-time: tokenize JSONL text and create fixed-length samples
indexer = create_tokenize_and_index_jsonl_text_func(
    input_base_path='./corpus',
    tokenizer_func=tokenizer.encode,
    eos_idx=tokenizer.eos_token_id,
    sample_size=1024,  # Fixed context length
)
indexer()
# Creates: ./corpus/tokenized-samples/*.bin, ./corpus/index/*

# Training: fixed-length token sequences ready for NTP
dataset = TokensDataset.create_from_index_on_filesystem(
    './corpus',
    token_dtype=np.uint16,
)
loader = DataLoader(dataset, batch_size=8, shuffle=True)

for batch in loader:
    # batch shape: (8, 1024) — ready for next-token prediction
    input_ids = batch[:, :-1]
    labels = batch[:, 1:]
    ...

How It Works

Data Forager uses a two-phase approach:

Phase 1: Indexing (One-Time)

Scan through your data files and record the byte offset of each sample:

sample_locations.bin:
┌─────────────┬─────────────┬───────────┐
│ file_index  │ byte_offset │ num_bytes │
│   uint64    │   uint64    │  uint64   │
├─────────────┼─────────────┼───────────┤
│     0       │      0      │    156    │  ← Sample 0: file 0, bytes 0-155
│     0       │    156      │    203    │  ← Sample 1: file 0, bytes 156-358
│     1       │      0      │    189    │  ← Sample 2: file 1, bytes 0-188
│    ...      │    ...      │    ...    │
└─────────────┴─────────────┴───────────┘

Memory footprint: 24 bytes per sample. A 1M sample dataset needs only ~24 MB for the index.

Phase 2: Random Access (Training)

When you request dataset[idx]:

Look up (file_index, byte_offset, num_bytes) from the index
seek() to that position in the file
read() exactly num_bytes
Parse and return the sample

This is O(1) regardless of dataset size — no scanning, no loading everything into memory.

Note: Linux will cache frequently accessed data in the page cache when sufficient RAM is available, further improving performance.

Components

Index Stores

IndexStoreInterface — Protocol defining how indices are stored and loaded.

IndexStore (filesystem-based) — Default implementation storing indices as files:

file_location.txt — List of data file paths
sample_locations.bin — Binary array of (file_index, byte_offset, num_bytes) tuples

from data_forager.index_stores.fs_based import IndexStore

# Used internally by indexers; rarely needed directly
store = IndexStore(base_path='./data', index_data_folder='index')

Datasets

All datasets implement __len__ and __getitem__, making them compatible with PyTorch DataLoader.

Dataset — Abstract base class providing:

create_from_index_on_filesystem(base_path) — Load index and create dataset
initialize() — Open file handles (called automatically on first access)
Random access via dataset[idx] or dataset[start:stop:step]

JsonlDataset — Returns parsed JSON dicts:

from data_forager.datasets.jsonl import JsonlDataset

dataset = JsonlDataset.create_from_index_on_filesystem('./data')
sample = dataset[0]  # Returns: {'text': '...', 'source': '...', ...}

TokensDataset — Returns numpy arrays of token IDs:

from data_forager.datasets.tokens import TokensDataset
import numpy as np

dataset = TokensDataset.create_from_index_on_filesystem(
    './corpus',
    token_dtype=np.uint16,
)
sample = dataset[0]  # Returns: np.array([1534, 892, 2041, ...], dtype=uint16)

Indexers

FileTextLinesIndexer — Base indexer for line-based text files. Scans files and records byte offsets for each line.

create_default_jsonl_indexer(input_base_path) — Creates an indexer for JSONL files:

from data_forager.indexers.jsonl_indexer import create_default_jsonl_indexer

indexer = create_default_jsonl_indexer('./data')
indexer()  # Indexes all .jsonl files recursively

create_tokenize_and_index_jsonl_text_func(...) — Creates an indexer that:

Reads JSONL files
Extracts text (default: sample['text'])
Tokenizes using your tokenizer
Packs into fixed-length samples (with EOS separation)
Stores as binary files and builds index

from data_forager.indexers.tokenization_indexer import create_tokenize_and_index_jsonl_text_func

indexer = create_tokenize_and_index_jsonl_text_func(
    input_base_path='./corpus',
    tokenizer_func=tokenizer.encode,  # Your tokenizer
    eos_idx=tokenizer.eos_token_id,  # EOS token ID
    sample_size=1024,  # Fixed context length (None for variable)
    token_dtype=np.uint16,  # Token storage dtype
)
indexer()

Installation

pip install data-forager

Or install from source:

git clone https://github.com/visionscaper/data-forager.git
cd data-forager
pip install -e .

Requirements

Python >= 3.9
numpy
tqdm
basics (visionscaper-pybase)

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Feb 5, 2026

0.1.6

Jan 21, 2026

0.1.5

Jan 17, 2026

0.1.4

Jan 17, 2026

0.1.3

Jan 11, 2026

0.1.2

Jan 10, 2026

This version

0.1.1

Jan 10, 2026

0.1.0

Jan 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_forager-0.1.1.tar.gz (17.1 kB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_forager-0.1.1-py3-none-any.whl (17.0 kB view details)

Uploaded Jan 10, 2026 Python 3

File details

Details for the file data_forager-0.1.1.tar.gz.

File metadata

Download URL: data_forager-0.1.1.tar.gz
Upload date: Jan 10, 2026
Size: 17.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for data_forager-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6ada6b01d2e740417c8a4ed71711b3b13d701420aa77b5028fe263607c20af4a`
MD5	`5e630cd72379f31dbb1b614189e3c734`
BLAKE2b-256	`59aca0729498ed19acf8650e9e266657da242fe0809ccb43481f7720daf6d155`

See more details on using hashes here.

File details

Details for the file data_forager-0.1.1-py3-none-any.whl.

File metadata

Download URL: data_forager-0.1.1-py3-none-any.whl
Upload date: Jan 10, 2026
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for data_forager-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`083e5483cf4b56fe934bb1519d33fc5960e1be1108b819b2c0a8402db7c88e05`
MD5	`a627e014b900ec3db5afa3dae7cedddc`
BLAKE2b-256	`8862fa1de1f759c45be912abe901138184ec7b73f2a3b6ed78de9da0237c87f4`

See more details on using hashes here.

data-forager 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Forager

Why Data Forager?

Quick Start

Use Case 1: Random Access to JSONL Files

Use Case 2: Tokenized Samples for Language Model Training

How It Works

Phase 1: Indexing (One-Time)

Phase 2: Random Access (Training)

Components

Index Stores

Datasets

Indexers

Installation

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes