A collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
Project description
lm-datasets
lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
Installation
pip install lm-datasets
Usage
To download and extract the plain-text of one or more datasets, run the following command:
python -m lm_datasets.extract_plaintext $DATASET_ID $OUTPUT_DIR
By default, output is saved as JSONL files. To change the output format, you can use the --output_format
argument as below:
python -m lm_datasets.extract_plaintext $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
Available datasets
A list or table with all available datasets can be print with the follow command:
python -m lm_datasets.print_stats --print_output md
Dataset viewer
We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, run the following command:
streamlit viewer/app.py $RAW_DATASETS_DIR $PROCESSED_DATASET_DIR
Development & Contributions
Setup environment
git clone git@github.com:malteos/lm-datasets.git
cd lm-datasets
conda create -n lm-datasets python=3.10
conda activate lm-datasets
pip install -r requirements.txt
Install the pre-commit hooks
This repository uses git hooks to validate code quality and formatting.
pre-commit install
git config --bool flake8.strict true # Makes the commit fail if flake8 reports an error
To run the hooks:
pre-commit run --all-files
Testing
The tests can be executed with:
pytest --doctest-modules --cov-report term --cov=lm_datasets
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lm_datasets-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e18a0919500840fc281938c44de086fd33c2d74412216833a998d61e5875250 |
|
MD5 | 26b47c450c2cf768087b80d3b5624ba0 |
|
BLAKE2b-256 | 6b856cb257967b761b8dd78d943a85febadc8b36d32b910c687ba55f8376a6b6 |