Skip to main content

Python3 library for converting between various LLM dataset formats.

Project description

The llm-dataset-converter allows the conversion between various dataset formats for large language models (LLMs). Filters can be supplied as well, e.g., for cleaning up the data.

Dataset formats: * pairs: alpaca (r/w), csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w) * pretrain: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w) * translation: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)

Compression formats: * bzip * gzip * xz * zstd

Examples:

Simple conversion with logging info:

llm-convert \
  from-alpaca \
    -l INFO \
    --input ./alpaca_data_cleaned.json \
  to-csv-pr \
    -l INFO \
    --output alpaca_data_cleaned.csv

Automatic decompression/compression (based on file extension):

llm-convert \
  from-alpaca \
    --input ./alpaca_data_cleaned.json.xz \
  to-csv-pr \
    --output alpaca_data_cleaned.csv.gz

Filtering:

llm-convert \
  -l INFO \
  from-alpaca \
    -l INFO \
    --input alpaca_data_cleaned.json \
  keyword \
    -l INFO \
    --keyword function \
    --location any \
    --action keep \
  to-alpaca \
    -l INFO \
    --output alpaca_data_cleaned-filtered.json

Changelog

0.0.2 (2023-10-31)

  • added text-stats filter

  • stream writers accept iterable of data records now as well to improve throughput

  • text_utils.apply_max_length now uses simple whitespace splitting instead of searching for nearest word boundary to break a line, which results in a massive speed improvement

  • fix: text_utils.remove_patterns no longer multiplies the generated lines when using more than one pattern

  • added remove_patterns filter

  • pretrain and translation text writers now buffer records by default (-b, –buffer_size) in order to improve throughput

  • jsonlines writers for pair, pretrain and translation data are now stream writers

0.0.1 (2023-10-26)

  • initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-dataset-converter-0.0.2.tar.gz (66.8 kB view details)

Uploaded Source

File details

Details for the file llm-dataset-converter-0.0.2.tar.gz.

File metadata

  • Download URL: llm-dataset-converter-0.0.2.tar.gz
  • Upload date:
  • Size: 66.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for llm-dataset-converter-0.0.2.tar.gz
Algorithm Hash digest
SHA256 0a75f8c1bef12aa90f9184184c3534a35590c82626cb62ce9c9b8985073366f0
MD5 81b61722b63e646fe2aa8a1789cbaa83
BLAKE2b-256 564e899366eddf224716ec63b66e285e58ee36161d63aca3b6e58cd972aa11d4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page