Skip to main content

A collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

Project description

lm-datasets

PRs Welcome

lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

Installation

pip install lm-datasets

Usage

To download and extract the plain-text of one or more datasets, run the following command:

python -m lm_datasets.extract_plaintext $DATASET_ID $OUTPUT_DIR

By default, output is saved as JSONL files. To change the output format, you can use the --output_format argument as below:

python -m lm_datasets.extract_plaintext $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd

Available datasets

A list or table with all available datasets can be print with the follow command:

python -m lm_datasets.print_stats --print_output md

Dataset viewer

We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, run the following command:

streamlit viewer/app.py $RAW_DATASETS_DIR $PROCESSED_DATASET_DIR

Development & Contributions

Setup environment

git clone git@github.com:malteos/lm-datasets.git
cd lm-datasets

conda create -n lm-datasets python=3.10
conda activate lm-datasets

pip install -r requirements.txt

Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.

pre-commit install
git config --bool flake8.strict true  # Makes the commit fail if flake8 reports an error

To run the hooks:

pre-commit run --all-files

Testing

The tests can be executed with:

pytest --doctest-modules --cov-report term --cov=lm_datasets

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm-datasets-0.0.1.tar.gz (270.8 kB view hashes)

Uploaded Source

Built Distribution

lm_datasets-0.0.1-py3-none-any.whl (128.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page