Skip to main content

Lazy-loading HF Datasets sourced from AWS S3 buckets and a chunking text document tokenizer.

Project description

Python Module for Huggingface Datasets from S3

This Python module provides tools to work seamlessly with data stored in Amazon S3 buckets, specifically designed for creating Huggingface datasets.Dataset instances. It includes two primary components: S3Dataset for creating datasets from S3 objects, and a generator utility for lazily tokenizing text data, facilitating domain adaptation for language models.

Features

  • S3Dataset: An adapter for creating lazily-loaded Huggingface datasets from S3 bucket contents. Supports filtering by key prefixes, explicit key lists, or dataset identifiers for flexible dataset creation.
  • TextDS2TokensGenerator: A utility for generating tokens from dataset items on-the-fly, reducing startup latency and network traffic during the preparation of large datasets for training or fine-tuning language models.

Installation

Ensure you have Python 3.6+ installed. Install the module and its dependencies via pip:

pip install s3datasets

# or from source
# pip install "git+https://github.com/stevemadere/s3-datasets.git@v2.1.1"

Quick Start

  1. Configure AWS Credentials: Set up your AWS credentials via environment variables or AWS configuration files to access your S3 buckets. e.g. AWS_PROFILE

  2. Create an S3Dataset Instance: Initialize with a bucket name and selection criteria (prefix, key list, or dataset ID).

from s3datasets import S3TextDataset

# Example: Create dataset from all objects with a specific prefix
s3_dataset: S3TextDataset = S3TextDataset(bucket_name="my_bucket", prefix="my_data/")
  1. Convert to Huggingface Dataset: Use the to_full_dataset() method to obtain a dataset instance, ready for use with Huggingface's datasets library.
my_hf_dataset:datasets.Dataset = s3_dataset.to_full_dataset()
  1. Lazy Tokenization: Utilize the TextDS2TokensGenerator to prepare your data for model training without downloading all of the data before training begins.
    import datasets
    from tokengenerators import TextDS2TokensGenerator

    generator:TextDS2TokensGenerator = TextDS2TokensGenerator(my_hf_dataset,tokenizer, chunk_len=2048, min_stride= 64, max_waste=64)
    training_ds:datasets.IterableDataset = datasets.IterableDataset.from_generator(generator)

Usage

S3Dataset

S3Dataset facilitates the creation of datasets from S3. It supports various modes of selection for bucket objects and allows for lazy loading of data to reduce network traffic and startup delays. It can also pre-process the content of the S3 objects to interpret them as binary data, text, or json to be automatically decoded into objects.

Initialization

# Direct specification with a list of keys
s3dataset = S3Dataset(bucket_name="my_bucket", key_list=["file1.txt", "file2.txt"])

# Using a dataset ID pointing to a JSON-encoded list of keys
s3dataset = S3Dataset(bucket_name="my_bucket", dataset_id="path/to/key_list.json")

# For either of the examples above, use the subclass S3TextDataset or S3JSONDataset to get utf-8 text or objects respectively

s3textdataset = S3TextDataset(bucket_name="my_bucket", key_list=["textdocs/file1.txt", "textdocs/file2.txt"])
s3jsonataset = S3JSONDataset(bucket_name="my_bucket", key_list=["objects/o1.json", "objects/o2.json"])

Converting to huggingface dataset

t_dataset:datasets.Dataset = s3textdataset.to_full_dataset() some_text:str = t_dataset[0]['text']

o_dataset:datasets.Dataset = s3jsondataset.to_full_dataset() an_object:any = o_dataset[0]['obj']

TextDS2TokensGenerator

The TextDS2TokensGenerator is a tool for converting a dataset of text documents into tokenized chunks of fixed length suitable for LLM training with minimal training startup delay. It was designed to ease the process of domain-adapting a large language model (LLM) from a corpus of documents of varying sizes but typically exceeding the context window used for training.

It can be used with Dataset.from_generator() to build a dataset of token sequences of a specified fixed length in a lazy manner, load and tokenizing data just-in-tim as the trainer processes it.

Key Features

  • Efficient Tokenization: Generates token sequences lazily, saving significant memory when working with large datasets.
  • Flexible Document Handling: Capable of slicing long documents into fixed-length chunks with configurable overlap and waste thresholds, ensuring comprehensive coverage of the text data.
  • Versatile Dataset Compatibility: Works seamlessly with both IterableDataset and regular Dataset instances from the Huggingface datasets library.
  • Adaptive Stride: Tokenizes text documents with "adaptive stride" ensuring the longest continuous context possible to maximize real learning.

e.g. When producing tokenized chunks of 4k tokens, a document that would tokenize to 4k+min_stride+1 tokens total will be tokenized as two chunks of exactly 4k tokens, one anchored at the beginning and the other anchored at the end with whatever overlap is necessary to achieve that. This holds true for any larger documents up to 8k - min_stride. Once the tokenized length exceeds 8k-min_stride, the number of chunks produced is increased to 3 with substantial stride (overlap) of the chunks. In this way, the chunks are always 4k long for efficient training and all text is seen by the model with sufficient prefix context (as defined by min_stride)

Usage

To utilize the TextDS2TokensGenerator, initialize it with your dataset, tokenizer, and configuration for chunk length, stride, and waste. Here is a basic example:

from your_module_name import TextDS2TokensGenerator
from transformers import AutoTokenizer
from datasets import load_dataset

# Load your dataset
dataset = load_dataset('path_to_your_dataset')

# Initialize your tokenizer
tokenizer = AutoTokenizer.from_pretrained('your_preferred_model')

# Create the TextDS2TokensGenerator instance
generator = TextDS2TokensGenerator(
    source_dataset=dataset,
    base_tokenizer=tokenizer,
    text_field_name="text",  # The field in your dataset containing text documents
    chunk_len=4096,          # Desired token sequence length
    min_stride=64,           # Minimum stride between chunks
    max_waste=64,            # Maximum allowed tokens to waste per chunk
    include_all_keys=False   # Whether to include all other keys from the original dataset items
)

# Generate the tokenized dataset
tokenized_dataset = Dataset.from_generator(generator)

Customization and Advanced Usage

  • Chunk Length (chunk_len): Adjust this parameter to match the input size expected by your model.
  • Stride (min_stride): Control the overlap between consecutive chunks to ensure sufficient context to make the training on every token meaningful.
  • Waste (max_waste): Fine-tune the balance between coverage and efficiency by specifying the maximum number of tokens that can be disregarded (never seen during training) at the end of a document.
  • Inclusion of Original Keys (include_all_keys): Optionally include all key-value pairs from the original dataset items in the tokenized output, excluding the text itself. (This is mostly used for debugging and profiling)

Testing

  1. Copy the file example.env to .env and customize its contents
  2. pytest


## Contributing

Contributions are welcome! Please submit pull requests or open issues to suggest improvements or report bugs.

## License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3datasets-2.1.1.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

s3datasets-2.1.1-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file s3datasets-2.1.1.tar.gz.

File metadata

  • Download URL: s3datasets-2.1.1.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for s3datasets-2.1.1.tar.gz
Algorithm Hash digest
SHA256 acee22d2b114229a5d3d04a32798d91fae9da07ee041fd345dd72ed460840fcb
MD5 96d8ecbbbc288152784d3b9a85e3c3a0
BLAKE2b-256 bfc7857f15ec090e444db8c6e3d002e5679de473b15d90ee34aaaf784419d418

See more details on using hashes here.

File details

Details for the file s3datasets-2.1.1-py3-none-any.whl.

File metadata

  • Download URL: s3datasets-2.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for s3datasets-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5c224ce99198b8cf68660e9b928f9c0602917dfa99f4cbe8b1df409c88498ec4
MD5 b19ae83902f028b89f8bb3b18089f391
BLAKE2b-256 b2e393baa1cd3276a0e131f9e7dd46a8c34a80510a341092425c7d59d9db28cd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page