Python3 library for converting between various LLM dataset formats.
Project description
The llm-dataset-converter allows the conversion between various dataset formats for large language models (LLMs). Filters can be supplied as well, e.g., for cleaning up the data.
Dataset formats:
pairs: alpaca (r/w), csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w)
pretrain: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)
translation: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)
Compression formats:
bzip
gzip
xz
zstd
Examples:
Simple conversion with logging info:
llm-convert \ from-alpaca \ -l INFO \ --input ./alpaca_data_cleaned.json \ to-csv-pr \ -l INFO \ --output alpaca_data_cleaned.csv
Automatic decompression/compression (based on file extension):
llm-convert \ from-alpaca \ --input ./alpaca_data_cleaned.json.xz \ to-csv-pr \ --output alpaca_data_cleaned.csv.gz
Filtering:
llm-convert \ -l INFO \ from-alpaca \ -l INFO \ --input alpaca_data_cleaned.json \ keyword \ -l INFO \ --keyword function \ --location any \ --action keep \ to-alpaca \ -l INFO \ --output alpaca_data_cleaned-filtered.json
Changelog
0.0.3 (2023-11-10)
added the record-window filter
added the llm-registry tool for querying the registry from the command-line
added the replace_patterns method to ldc.text_utils module
added the replace-patterns filter
added -p/–pretty-print flag to to-alpaca writer
added pairs-to-llama2 and llama2-to-pairs filter (since llama2 has instruction as part of the string, it is treated as pretrain data)
added to-llama2-format filter for pretrain records (no [INST]…[/INST] block)
now requiring seppl>=0.0.8 in order to raise Exceptions when encountering unknown arguments
0.0.2 (2023-10-31)
added text-stats filter
stream writers accept iterable of data records now as well to improve throughput
text_utils.apply_max_length now uses simple whitespace splitting instead of searching for nearest word boundary to break a line, which results in a massive speed improvement
fix: text_utils.remove_patterns no longer multiplies the generated lines when using more than one pattern
added remove-patterns filter
pretrain and translation text writers now buffer records by default (-b, –buffer_size) in order to improve throughput
jsonlines writers for pair, pretrain and translation data are now stream writers
0.0.1 (2023-10-26)
initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for llm-dataset-converter-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a8b672cf98562ced9a36e29bc83d3dbffdba948e4028a7f8f23296dc98464a1 |
|
MD5 | 98ba93f12187a137b613745a5dabdee7 |
|
BLAKE2b-256 | e6afeeeea076c484798aacece884f31f094c4ca10c97bafd26eee7b25d739090 |