Python3 library for converting between various LLM dataset formats.
Project description
The llm-dataset-converter allows the conversion between various dataset formats for large language models (LLMs). Filters can be supplied as well, e.g., for cleaning up the data.
Dataset formats: * pairs: alpaca (r/w), csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w) * pretrain: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w) * translation: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)
Compression formats: * bzip * gzip * xz * zstd
Examples:
Simple conversion with logging info:
llm-convert \ from-alpaca \ -l INFO \ --input ./alpaca_data_cleaned.json \ to-csv-pr \ -l INFO \ --output alpaca_data_cleaned.csv
Automatic decompression/compression (based on file extension):
llm-convert \ from-alpaca \ --input ./alpaca_data_cleaned.json.xz \ to-csv-pr \ --output alpaca_data_cleaned.csv.gz
Filtering:
llm-convert \ -l INFO \ from-alpaca \ -l INFO \ --input alpaca_data_cleaned.json \ keyword \ -l INFO \ --keyword function \ --location any \ --action keep \ to-alpaca \ -l INFO \ --output alpaca_data_cleaned-filtered.json
Changelog
0.0.2 (2023-10-31)
added text-stats filter
stream writers accept iterable of data records now as well to improve throughput
text_utils.apply_max_length now uses simple whitespace splitting instead of searching for nearest word boundary to break a line, which results in a massive speed improvement
fix: text_utils.remove_patterns no longer multiplies the generated lines when using more than one pattern
added remove_patterns filter
pretrain and translation text writers now buffer records by default (-b, –buffer_size) in order to improve throughput
jsonlines writers for pair, pretrain and translation data are now stream writers
0.0.1 (2023-10-26)
initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file llm-dataset-converter-0.0.2.tar.gz
.
File metadata
- Download URL: llm-dataset-converter-0.0.2.tar.gz
- Upload date:
- Size: 66.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a75f8c1bef12aa90f9184184c3534a35590c82626cb62ce9c9b8985073366f0 |
|
MD5 | 81b61722b63e646fe2aa8a1789cbaa83 |
|
BLAKE2b-256 | 564e899366eddf224716ec63b66e285e58ee36161d63aca3b6e58cd972aa11d4 |