Skip to main content

Fetch, normalize, and convert heterogeneous chat/instruct datasets into a single LLM training format

Project description

convmerge

Fetch, normalize, and convert heterogeneous chat / instruct datasets into a single LLM training format (JSONL).

Repository: github.com/snowmuffin/convmerge
Status: pre-1.0; APIs and CLI may change between minor versions.

Install

pip install convmerge                    # core: convert, normalize, dedupe, turns
pip install "convmerge[fetch]"           # + YAML manifest fetcher (GitHub)
pip install "convmerge[fetch-hf]"        # + HuggingFace entries (adds ``datasets``)
pip install "convmerge[fetch-all]"       # all fetch-related extras
pip install "convmerge[parquet]"         # + parquet streaming input

Or from a clone:

git clone https://github.com/snowmuffin/convmerge.git
cd convmerge
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,fetch-all,parquet]"

The four use cases

1. fetch — pull raw data from HF + GitHub via a YAML manifest

# manifest.yaml
version: 1
defaults: { output_root: ./raw, resume: true }
auth:     { hf_token_env: HF_TOKEN, github_token_env: GITHUB_TOKEN }
datasets:
  - { name: alpaca-ko, hf: MarkrAI/KoCommercial-Dataset, split: train }
  - { name: orca-raw,
      url: https://raw.githubusercontent.com/org/repo/main/data/train.jsonl }
  - { name: repo-tree,
      url: https://github.com/org/example-repo, ext: [".jsonl"] }
  - { name: big-lfs,
      url: https://github.com/org/big-lfs-repo, mode: clone, lfs: true }
convmerge fetch manifest.yaml -o ./raw
# or one-shot shortcuts:
convmerge fetch hf://org/dataset -o ./raw --split train
convmerge fetch https://github.com/org/repo -o ./raw --ext .jsonl

Tokens resolve in order CLI flag → file → env var, and are redacted from logs. See docs/fetch.md for the full schema.

2. normalize — reshape parquet / messy JSON into clean JSONL

convmerge normalize -i ./raw -o ./jsonl

Handles parquet (streamed via pyarrow), top-level JSON arrays, concatenated single-line JSON ({...}{...}{...}), and already-valid JSONL. A directory input is walked recursively and mirrored under the output directory.

3. convert — adapter + emitter pipeline

convmerge convert -i ./jsonl/alpaca.jsonl -o ./train/alpaca.messages.jsonl \
  --from alpaca --format messages

convmerge convert -i ./jsonl/mixed.jsonl -o ./train/mixed.messages.jsonl \
  --from auto --format messages         # auto-detecting chat adapter

Adapters: alpaca, sharegpt, chat (alias auto).
Emitters: messages, alpaca.

4. dedupe / turns — final cleanup + train/eval split hook

convmerge dedupe -i ./train/mixed.messages.jsonl -o ./train/mixed.dedup.jsonl
convmerge turns  -i ./train/mixed.dedup.jsonl \
  --single-out ./train/single.jsonl \
  --multi-out  ./train/multi.jsonl

See docs/format.md for adapter / emitter schemas and docs/fetch.md for manifest details.

Development

See CONTRIBUTING.md. CI runs Ruff + pytest on Python 3.10 – 3.12.

ruff check src tests
pytest -q

PyPI release (maintainers)

Releases run from .github/workflows/publish.yml on pushing a v* tag. Publishing authenticates via the PYPI_API_TOKEN GitHub Actions secret (a PyPI API token scoped to the convmerge project).

  1. Create an API token on pypi.org scoped to convmerge.
  2. In the GitHub repo, Settings → Secrets and variables → Actions → New repository secret, add PYPI_API_TOKEN with the token value.
  3. Tag and push: git tag v0.2.0 && git push origin v0.2.0.

Changelog

CHANGELOG.md

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

convmerge-0.2.0.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

convmerge-0.2.0-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file convmerge-0.2.0.tar.gz.

File metadata

  • Download URL: convmerge-0.2.0.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for convmerge-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fe5adf42b34010bdc42e6923ec7d7bb973097d52c26feed9bc57fe4491e67e28
MD5 9dd872afa0dc7aefcf771637ec5ccdb5
BLAKE2b-256 7351d7553130f8ed2a691b195cdea8f1b27f15789569107c299fddfc20d6e435

See more details on using hashes here.

File details

Details for the file convmerge-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: convmerge-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for convmerge-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06de365e83d54ac2205adbdeaeb86436ad9ba9fe6e17cfa6d3e5ab907ba0b495
MD5 2d63a0c8315f5c49af2223232a44e58d
BLAKE2b-256 aed20e7052202051f2f39e99d329d77c5d7355cfdaea73a8d4963c8a570e9b49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page