Fetch, normalize, and convert heterogeneous chat/instruct datasets into a single LLM training format
Project description
convmerge
Fetch, normalize, and convert heterogeneous chat / instruct datasets into a single LLM training format (JSONL).
Repository: github.com/snowmuffin/convmerge
Status: pre-1.0; APIs and CLI may change between minor versions.
Install
pip install convmerge # core: convert, normalize, dedupe, turns
pip install "convmerge[fetch]" # + YAML manifest fetcher (GitHub)
pip install "convmerge[fetch-hf]" # + HuggingFace entries (adds ``datasets``)
pip install "convmerge[fetch-all]" # all fetch-related extras
pip install "convmerge[parquet]" # + parquet streaming input
Or from a clone:
git clone https://github.com/snowmuffin/convmerge.git
cd convmerge
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,fetch-all,parquet]"
The four use cases
1. fetch — pull raw data from HF + GitHub via a YAML manifest
# manifest.yaml
version: 1
defaults: { output_root: ./raw, resume: true }
auth: { hf_token_env: HF_TOKEN, github_token_env: GITHUB_TOKEN }
datasets:
- { name: alpaca-ko, hf: MarkrAI/KoCommercial-Dataset, split: train }
- { name: orca-raw,
url: https://raw.githubusercontent.com/org/repo/main/data/train.jsonl }
- { name: repo-tree,
url: https://github.com/org/example-repo, ext: [".jsonl"] }
- { name: big-lfs,
url: https://github.com/org/big-lfs-repo, mode: clone, lfs: true }
convmerge fetch manifest.yaml -o ./raw
# or one-shot shortcuts:
convmerge fetch hf://org/dataset -o ./raw --split train
convmerge fetch https://github.com/org/repo -o ./raw --ext .jsonl
Tokens resolve in order CLI flag → file → env var, and are redacted from logs. See docs/fetch.md for the full schema.
2. normalize — reshape parquet / messy JSON into clean JSONL
convmerge normalize -i ./raw -o ./jsonl
Handles parquet (streamed via pyarrow), top-level JSON arrays, concatenated
single-line JSON ({...}{...}{...}), and already-valid JSONL. A directory
input is walked recursively and mirrored under the output directory.
3. convert — adapter + emitter pipeline
convmerge convert -i ./jsonl/alpaca.jsonl -o ./train/alpaca.messages.jsonl \
--from alpaca --format messages
convmerge convert -i ./jsonl/mixed.jsonl -o ./train/mixed.messages.jsonl \
--from auto --format messages # auto-detecting chat adapter
Adapters: alpaca, sharegpt, chat (alias auto).
Emitters: messages, alpaca.
4. dedupe / turns — final cleanup + train/eval split hook
convmerge dedupe -i ./train/mixed.messages.jsonl -o ./train/mixed.dedup.jsonl
convmerge turns -i ./train/mixed.dedup.jsonl \
--single-out ./train/single.jsonl \
--multi-out ./train/multi.jsonl
See docs/format.md for adapter / emitter schemas and docs/fetch.md for manifest details.
Development
See CONTRIBUTING.md. CI runs Ruff + pytest on Python 3.10 – 3.12.
ruff check src tests
pytest -q
PyPI release (maintainers)
Releases run from .github/workflows/publish.yml
on pushing a v* tag. Publishing authenticates via the PYPI_API_TOKEN
GitHub Actions secret (a PyPI API token scoped to the convmerge project).
- Create an API token on pypi.org
scoped to
convmerge. - In the GitHub repo, Settings → Secrets and variables → Actions → New
repository secret, add
PYPI_API_TOKENwith the token value. - Tag and push:
git tag v0.2.0 && git push origin v0.2.0.
Changelog
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file convmerge-0.2.0.tar.gz.
File metadata
- Download URL: convmerge-0.2.0.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe5adf42b34010bdc42e6923ec7d7bb973097d52c26feed9bc57fe4491e67e28
|
|
| MD5 |
9dd872afa0dc7aefcf771637ec5ccdb5
|
|
| BLAKE2b-256 |
7351d7553130f8ed2a691b195cdea8f1b27f15789569107c299fddfc20d6e435
|
File details
Details for the file convmerge-0.2.0-py3-none-any.whl.
File metadata
- Download URL: convmerge-0.2.0-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06de365e83d54ac2205adbdeaeb86436ad9ba9fe6e17cfa6d3e5ab907ba0b495
|
|
| MD5 |
2d63a0c8315f5c49af2223232a44e58d
|
|
| BLAKE2b-256 |
aed20e7052202051f2f39e99d329d77c5d7355cfdaea73a8d4963c8a570e9b49
|