Skip to main content

No project description provided

Project description

Dextro: Dataset Indexing for Blazing Fast Random Access

Dextro is a streamlined indexing toolkit designed for large, multi-file text datasets. It enables O(1) random access to any dataset sample through memory mapping, eliminating the need for preloading. This toolkit is essential for researchers and developers working with extensive language datasets, offering a significant leap in processing and training flexibility without altering the original data format.

Motivation

The ongoing revolution in artificial intelligence, particularly in LLM, is heavily reliant on extensive language datasets. However, these datasets often come in simple, non-indexed formats like JSON Lines, posing challenges for data handling. These challenges include the need for loading entire datasets into RAM for quick access, the limitations of sequential streaming, and the constraints on processing and training flexibility due to non-indexed formats.

Dextro addresses these challenges by enabling the efficient indexing of large, multi-file datasets without altering the original data. The index tracks the start and end positions of each sample within its source file, along with optional metadata for enhanced filtering capabilities. Through memory mapping, Dextro achieves O(1) random access to any record across multiple files, significantly improving data handling efficiency.

Getting Started

Installation

Install Dextro easily via pip:

pip install dextro

Install with all dependencies

Index Your Dataset

Dextro works with datasets in JSON Lines format, split across multiple files. To index such a dataset, organize your files as follows:

dataset/
    part001.jsonl
    part002.jsonl
    ...
    part999.jsonl

Example content (dataset/part001.jsonl):

{"text": "first item", ...}
{"text": "second item", ...}

Run the following command to index your dataset, creating an index.parquet file in the dataset folder:

dextro create-index dataset/

This index file includes the filename, start, and end positions for each sample, facilitating efficient data access.

Accessing Indexed Datasets

Dextro integrates with PyTorch's Dataset class, allowing for easy loading of indexed datasets. Here's how to sequentially iterate through your dataset:

from tqdm import tqdm
from dextro.dataset import IndexedDataset

dataset = IndexedDataset(data_root='dataset/')

for text in tqdm(dataset):
    pass

To demonstrate random access with shuffling, you can use a DataLoader as follows:

from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=128, shuffle=True)

for batch in tqdm(loader):
    pass

Dextro's memory mapping ensures that only the accessed data is loaded into memory, optimizing resource usage.

Performance

Thanks to its minimal overhead and efficient data access, Dextro can process large NLP datasets at speeds close to those of reading directly from SSDs. This capability makes it possible to navigate through terabytes of data within minutes, even on consumer-grade storage.

Comparison to 🤗 Datasets

The 🤗 Datasets library also features memory-mapped loading of partitioned datasets. However, as of February 2024, it lacks the capability for random access, and shuffled iteration across a dataset is confined to the limits of an item buffer. Moreover, 🤗 Datasets does not offer the functionality to pre-filter data through a lightweight dataset index.

Advanced Features

Index Enrichers

Dextro supports enrichers to augment index records with additional information, such as metadata derived from the source data or advanced operations like language detection. You can specify enrichers during indexing for enhanced functionality:

dextro create-index dataset/ --enrichers=detect_language

Data Filtering

Dextro allows for advanced data filtering directly on the index, facilitating efficient data selection without explicit loading:

import polars as pl
from dextro.dataset import IndexedDataset

# Example filter: Select texts within a specific character length range
# This assumes that the `TextLength` enricher has been used during indexing
dataset = IndexedDataset(
    data_root='dataset/',
    index_filter=(256 <= pl.col('meta_text_length')) & (pl.col('meta_text_length') <= 1024)
)

Non-Language Datasets

Dextro can in principle work with any data modality as it this doesn't make assumptions about the data representation.

Other Data Formats

With the default settings, Dextro assumes that the dataset is formatted in JSON Lines format. Other formats are supporte via the load_fn option of the FileIndexer class. However, records currently have to be separated by lines.

Development

Install Dev Dependencies

poetry install --all-extras --with=dev

Run Tests

pytest tests

Autoformat

ruff format .

Why "Dextro"?

The name "Dextro" is inspired by dextrose, a historic term for glucose and associated with fast energy delivery. This name reflects the toolkit's aim to provide fast, efficient processing and low overhead for dataset handling, mirroring the quick energy boost dextrose is known for.

Dextro is designed to be the optimal solution for managing and accessing large language datasets, enabling rapid and flexible data handling to support the advancement of AI and machine learning research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dextro-0.1.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

dextro-0.1.1-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file dextro-0.1.1.tar.gz.

File metadata

  • Download URL: dextro-0.1.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.9 Linux/5.15.0-97-generic

File hashes

Hashes for dextro-0.1.1.tar.gz
Algorithm Hash digest
SHA256 346e6aa9be6d1550a4e53309caadd3af77d957c623dead6843faa4c4cfeeacf8
MD5 d3a609328c19733d33d71679d71e60b2
BLAKE2b-256 64c73151fc3249ea211aa6e7f87d1906ada0da734235b2adf67dc7e11fba7f18

See more details on using hashes here.

File details

Details for the file dextro-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dextro-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.9 Linux/5.15.0-97-generic

File hashes

Hashes for dextro-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e1e6934a05202007beb74b70a05ba92587870398622c04e967dc7e5d576047fc
MD5 586fdad36249ab5d2f4a2af58edf5793
BLAKE2b-256 c9ec20f9d3ceda0074d2ebe0afd23d52260b7f12eac6365a4caff09dd6793f4f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page