Skip to main content

A collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

Project description

lm-datasets

PRs Welcome

lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

The documentation is available here.

Quick start

Installation

Install the lm-datasets package with pip:

pip install lm-datasets

In order to keep the package minimal by default, lm-datasets comes with optional dependencies useful for some use cases. For example, if you want to have the text extraction for all available datasets, run:

pip install lm-datasets[datasets]

Download and text extraction

To download and extract the plain-text of one or more datasets, run the following command:

lm_datasets extract_text $DATASET_ID $OUTPUT_DIR

By default, output is saved as JSONL files. To change the output format, you can use the --output_format argument as below:

lm_datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd

Available datasets

A list or table with all available datasets can be print with the follow command:

lm_datasets print_stats --print_output md

Token count by language

Language Tokens
bg 53 B
ca 5 B
code 250 B
cs 128 B
da 34 B
de 795 B
el 108 B
en 6 T
es 674 B
et 15 B
eu 696 M
fi 55 B
fr 655 B
ga 767 M
gl 70 M
hr 8 B
hu 179 B
it 386 B
lt 24 B
lv 14 B
mt 4 B
nl 238 B
nn 307 M
no 9 B
pl 223 B
pt 187 B
ro 77 B
sh 2 M
sk 47 B
sl 11 B
sr 10 B
sv 89 B
uk 47 B

Token count by source

Source Tokens
academic_slovene_kas 1 B
bgnc_admin_eur 79 M
bgnc_news_corpus 18 M
brwac 3 B
bulgarian_news 283 M
bulnc 567 M
cabernet 712 M
cc_gigafida 127 M
colossal_oscar 208 B
croatian_news_engri 695 M
curlicat 410 M
danewsroom 472 M
danish_gigaword 1 B
dewac 2 B
dialogstudio 0
dk_clarin 441 M
enc2021 0
estonian_reference_corpus 175 M
eurlex 121 B
euscrawl 423 M
ga_bilingual_legistation 4 M
ga_universal_dependencies 3 M
greek_legal_code 45 M
greek_web_corpus 3 B
hrwac 1 B
itwac 2 B
korpus_malti 366 M
legal_mc4 29 B
macocu 23 B
marcell_legislative_subcorpus_v2 31 M
norwegian_cc 5 B
openlegaldata 10 B
oscar 9 T
oscar_opengptx 245 B
parlamento_pt 819 M
pes2o 42 B
pl_nkjp 1 M
pl_parliamentary_corpus 671 M
proof_pile 8 B
redpajama 46 B
seimas_lt_en 48 k
sk_court_decisions 11 B
sk_laws 45 M
slwac_web 1 B
sonar 500 M
sonar_new_media 36 M
spanish_legal 3 B
srpkor 0
starcoder 250 B
state_related_latvian_web 1 M
styria_news 409 M
sv_gigaword 1 B
syn_v9 5 B
uk_laws 579 M
wiki 12 B
wikibooks 353 M
wikihow 2 M
wikinews 79 M
wikiquote 268 M
wikisource 2 B
wikivoyage 132 M
ylenews 0

Dataset viewer

We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, first clone this repository, install dependencies, and run the following command:

# clone is needed since streamlit does not support apps from modules yet
git clone https://github.com/malteos/lm-datasets.git

streamlit run src/lm_datasets/viewer/app.py -- \
    --raw_datasets_dir=$RAW_DATASETS_DIR \
    --output_dir=$PROCESSED_DATASET_DIR

Development & Contributions

Setup environment

To setup, your local development environment we recommend conda and cloning the repository. The repository also includes settings and launch scripts for VSCode.

git clone git@github.com:malteos/lm-datasets.git
cd lm-datasets

conda create -n lm-datasets python=3.10
conda activate lm-datasets

pip install -r requirements.txt

Alternatively, you can install the Python package directly from the dev branch:

pip install git+https://github.com/malteos/lm-datasets.git@dev

Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.

pre-commit install
git config --bool flake8.strict true  # Makes the commit fail if flake8 reports an error

To run the hooks:

pre-commit run --all-files

Testing

The tests can be executed with:

pytest --doctest-modules --cov-report term --cov=lm_datasets

Acknowledgements

The work on the lm-datasets software is partially funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) through the project OpenGPT-X (project no. 68GX21007D).

License

Apache 2.0

(Please note that the actual datasets are released with different licenses)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm-datasets-0.0.2.tar.gz (142.7 kB view details)

Uploaded Source

Built Distribution

lm_datasets-0.0.2-py3-none-any.whl (181.4 kB view details)

Uploaded Python 3

File details

Details for the file lm-datasets-0.0.2.tar.gz.

File metadata

  • Download URL: lm-datasets-0.0.2.tar.gz
  • Upload date:
  • Size: 142.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for lm-datasets-0.0.2.tar.gz
Algorithm Hash digest
SHA256 13fedce3161ff2cabc79cacd2ca0f8e916d7ee9dac7fc292f1c4836a19c4eddd
MD5 a3857c58544beefab296f9d9e1635b38
BLAKE2b-256 1d9b9b44fd4136d0c3a15d6a7d2f8b04fabf0820d7f67dd35e1536df4bd2d650

See more details on using hashes here.

File details

Details for the file lm_datasets-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: lm_datasets-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 181.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for lm_datasets-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec3d6d76b7e485d1ffc222fb5bf35d73e75dc5c8fd16ebb3cf2a8324909e578f
MD5 64fe24f004a98c3071cae63897810283
BLAKE2b-256 22f54c7259f2d2bb09de19e300998118364ea0a37db23edb4f3ec11b14f34b0e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page