Skip to main content

swiss army knife of scripts for transforming and processing datasets for machine learning.

Project description

mldatasets

swiss army knife of scripts for transforming and processing datasets for machine learning

conversion

Currently, mldataforge provides space- and time-efficient conversions between JSONL (with or without compression), MosaiclML Dataset (MDS format), and Parquet. The implementations handle conversions by individual samples or small batches of samples and make efficient use of multi-core architectures where possible. Consequently, mldataforge is an excellent choice when transforming TB-scale datasets on data processing nodes with many cores.

splitting

Currently, mldataforge provides space- and time-efficient splitting of JSONL (with or without compression). The implementations handle conversions by individual samples or small batches of samples and make efficient use of multi-core architectures where possible. The splitting function can take an already splitted dataset and re-split it with a different granularity.

installation and general usage

pip install mldataforge
python -m mldataforge --help

usage example: converting MosaiclML Dataset (MDS) to Parquet format

Usage: python -m mldataforge convert mds parquet [OPTIONS] OUTPUT_FILE
                                                 MDS_DIRECTORIES...

Options:
  --compression [snappy|gzip|zstd]
                                  Compress the output file (default: snappy).
  --overwrite                     Overwrite existing path.
  --yes                           Assume yes to all prompts. Use with caution
                                  as it will remove files or even entire
                                  directories without confirmation.
  --batch-size INTEGER            Batch size for loading data and writing
                                  files (default: 65536).
  --no-bulk                       Use a custom space and time-efficient bulk
                                  reader (only gzip and no compression).
  --help                          Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mldataforge-0.2.9.tar.gz (18.0 kB view details)

Uploaded Source

File details

Details for the file mldataforge-0.2.9.tar.gz.

File metadata

  • Download URL: mldataforge-0.2.9.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mldataforge-0.2.9.tar.gz
Algorithm Hash digest
SHA256 c0afecff1336fc4ee9f102fc406d787d65b4e6700318cf55a01484cf80071458
MD5 b6afff80d59077bc272c3233ddfbdd4a
BLAKE2b-256 8fa9f40bc0ca6bc76f515be6ae5d19ab5a9dfa4f8803de7d53b65ef6c5f8d1e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page