A collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language

Project description

lm-datasets

lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

The documentation is available here.

Quick start

Installation

Install the lm-datasets package with pip:

pip install lm-datasets

In order to keep the package minimal by default, lm-datasets comes with optional dependencies useful for some use cases. For example, if you want to have the text extraction for all available datasets, run:

pip install lm-datasets[datasets]

Download and text extraction

To download and extract the plain-text of one or more datasets, run the following command:

lm_datasets extract_text $DATASET_ID $OUTPUT_DIR

By default, output is saved as JSONL files. To change the output format, you can use the --output_format argument as below:

lm_datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd

Available datasets

A list or table with all available datasets can be print with the follow command:

lm_datasets print_stats --print_output md

Token count by language

Language	Tokens
bg	53 B
ca	5 B
code	250 B
cs	128 B
da	34 B
de	795 B
el	108 B
en	6 T
es	674 B
et	15 B
eu	696 M
fi	55 B
fr	655 B
ga	767 M
gl	70 M
hr	8 B
hu	179 B
it	386 B
lt	24 B
lv	14 B
mt	4 B
nl	238 B
nn	307 M
no	9 B
pl	223 B
pt	187 B
ro	77 B
sh	2 M
sk	47 B
sl	11 B
sr	10 B
sv	89 B
uk	47 B

Token count by source

Source	Tokens
academic_slovene_kas	1 B
bgnc_admin_eur	79 M
bgnc_news_corpus	18 M
brwac	3 B
bulgarian_news	283 M
bulnc	567 M
cabernet	712 M
cc_gigafida	127 M
colossal_oscar	208 B
croatian_news_engri	695 M
curlicat	410 M
danewsroom	472 M
danish_gigaword	1 B
dewac	2 B
dialogstudio	0
dk_clarin	441 M
enc2021	0
estonian_reference_corpus	175 M
eurlex	121 B
euscrawl	423 M
ga_bilingual_legistation	4 M
ga_universal_dependencies	3 M
greek_legal_code	45 M
greek_web_corpus	3 B
hrwac	1 B
itwac	2 B
korpus_malti	366 M
legal_mc4	29 B
macocu	23 B
marcell_legislative_subcorpus_v2	31 M
norwegian_cc	5 B
openlegaldata	10 B
oscar	9 T
oscar_opengptx	245 B
parlamento_pt	819 M
pes2o	42 B
pl_nkjp	1 M
pl_parliamentary_corpus	671 M
proof_pile	8 B
redpajama	46 B
seimas_lt_en	48 k
sk_court_decisions	11 B
sk_laws	45 M
slwac_web	1 B
sonar	500 M
sonar_new_media	36 M
spanish_legal	3 B
srpkor	0
starcoder	250 B
state_related_latvian_web	1 M
styria_news	409 M
sv_gigaword	1 B
syn_v9	5 B
uk_laws	579 M
wiki	12 B
wikibooks	353 M
wikihow	2 M
wikinews	79 M
wikiquote	268 M
wikisource	2 B
wikivoyage	132 M
ylenews	0

Dataset viewer

We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, first clone this repository, install dependencies, and run the following command:

# clone is needed since streamlit does not support apps from modules yet
git clone https://github.com/malteos/lm-datasets.git

streamlit run src/lm_datasets/viewer/app.py -- \
    --raw_datasets_dir=$RAW_DATASETS_DIR \
    --output_dir=$PROCESSED_DATASET_DIR

Development & Contributions

Setup environment

To setup, your local development environment we recommend conda and cloning the repository. The repository also includes settings and launch scripts for VSCode.

git clone git@github.com:malteos/lm-datasets.git
cd lm-datasets

conda create -n lm-datasets python=3.10
conda activate lm-datasets

pip install -r requirements.txt

Alternatively, you can install the Python package directly from the dev branch:

pip install git+https://github.com/malteos/lm-datasets.git@dev

Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.

pre-commit install
git config --bool flake8.strict true  # Makes the commit fail if flake8 reports an error

To run the hooks:

pre-commit run --all-files

Testing

The tests can be executed with:

pytest --doctest-modules --cov-report term --cov=lm_datasets

Acknowledgements

The work on the lm-datasets software is partially funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) through the project OpenGPT-X (project no. 68GX21007D).

License

Apache 2.0

(Please note that the actual datasets are released with different licenses)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language

Release history Release notifications | RSS feed

This version

0.0.2

Dec 15, 2023

0.0.1

Sep 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm-datasets-0.0.2.tar.gz (142.7 kB view hashes)

Uploaded Dec 15, 2023 Source

Built Distribution

lm_datasets-0.0.2-py3-none-any.whl (181.4 kB view hashes)

Uploaded Dec 15, 2023 Python 3

Hashes for lm-datasets-0.0.2.tar.gz

Hashes for lm-datasets-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`13fedce3161ff2cabc79cacd2ca0f8e916d7ee9dac7fc292f1c4836a19c4eddd`
MD5	`a3857c58544beefab296f9d9e1635b38`
BLAKE2b-256	`1d9b9b44fd4136d0c3a15d6a7d2f8b04fabf0820d7f67dd35e1536df4bd2d650`

Hashes for lm_datasets-0.0.2-py3-none-any.whl

Hashes for lm_datasets-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec3d6d76b7e485d1ffc222fb5bf35d73e75dc5c8fd16ebb3cf2a8324909e578f`
MD5	`64fe24f004a98c3071cae63897810283`
BLAKE2b-256	`22f54c7259f2d2bb09de19e300998118364ea0a37db23edb4f3ec11b14f34b0e`