Fetch, normalize, and convert heterogeneous chat/instruct datasets into a single LLM training format
Project description
convmerge
convmerge is a data-preparation library for supervised fine-tuning (SFT)
datasets. It fetches, normalizes, and merges heterogeneous chat / instruct
sources into a single newline-delimited JSON Lines layout that training code
can consume directly.
It is intentionally scoped to the pre-training-loop step: no model loading, no inference, no labeling, no training orchestration. See Out of scope below.
Repository: github.com/snowmuffin/convmerge
Status: pre-1.0; APIs and CLI may change between minor versions until 1.0.
Install
pip install convmerge # core: convert, normalize, dedupe, turns
pip install "convmerge[fetch]" # + YAML manifest fetcher (GitHub)
pip install "convmerge[fetch-hf]" # + HuggingFace entries (adds ``datasets``)
pip install "convmerge[fetch-all]" # all fetch-related extras
pip install "convmerge[parquet]" # + parquet streaming input
Or from a clone:
git clone https://github.com/snowmuffin/convmerge.git
cd convmerge
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,fetch-all,parquet]"
The four use cases
1. fetch — pull raw data from HF + GitHub via a YAML manifest
HuggingFace entries delegate to
datasets.load_dataset(...).to_json(...), i.e. the output is a JSONL dump of the selected split. GitHub entries support a single raw URL, recursive Trees API fetch with an extension filter, orgit clone(with optionalgit lfs pull).fetchis a reproducible downloader, not a mirror of HuggingFace's Arrow cache.
# manifest.yaml
version: 1
defaults: { output_root: ./raw, resume: true }
auth: { hf_token_env: HF_TOKEN, github_token_env: GITHUB_TOKEN }
datasets:
- { name: alpaca-ko, hf: MarkrAI/KoCommercial-Dataset, split: train }
- { name: orca-raw,
url: https://raw.githubusercontent.com/org/repo/main/data/train.jsonl }
- { name: repo-tree,
url: https://github.com/org/example-repo, ext: [".jsonl"] }
- { name: big-lfs,
url: https://github.com/org/big-lfs-repo, mode: clone, lfs: true }
convmerge fetch manifest.yaml -o ./raw
# or one-shot shortcuts:
convmerge fetch hf://org/dataset -o ./raw --split train
convmerge fetch https://github.com/org/repo -o ./raw --ext .jsonl
Tokens resolve in order CLI flag → file → env var, and are redacted from logs. See docs/fetch.md for the full schema.
2. normalize — reshape parquet / messy JSON into clean JSONL
convmerge normalize -i ./raw -o ./jsonl
Handles parquet (streamed via pyarrow), top-level JSON arrays, concatenated
single-line JSON ({...}{...}{...}), and already-valid JSONL. A directory
input is walked recursively and mirrored under the output directory.
3. convert — adapter + emitter pipeline
convmerge convert -i ./jsonl/alpaca.jsonl -o ./train/alpaca.messages.jsonl \
--from alpaca --format messages
convmerge convert -i ./jsonl/mixed.jsonl -o ./train/mixed.messages.jsonl \
--from auto --format messages # auto-detecting chat adapter
Adapters: alpaca, sharegpt, chat (alias auto).
Emitters: messages, alpaca.
chat/autois a heuristic adapter: it inspects the keys of each input record (messages,conversation(s),text,conversation_a/_b,instruction/input/output, …) and routes to the right branch with a configurable role map. For unusual schemas, pin an explicit adapter (alpaca,sharegpt) or override keys programmatically — see docs/format.md.
4. dedupe / turns — final cleanup + train/eval split hook
convmerge dedupe -i ./train/mixed.messages.jsonl -o ./train/mixed.dedup.jsonl
convmerge turns -i ./train/mixed.dedup.jsonl \
--single-out ./train/single.jsonl \
--multi-out ./train/multi.jsonl
See docs/format.md for adapter / emitter schemas and docs/fetch.md for manifest details.
Out of scope
To keep the package lean and dependency-free at its core, convmerge does
not include — and has no plans to include — the following:
- Model loading / inference / training. No PyTorch, Transformers, vLLM, or similar runtime is imported by the core or any shipped extra.
- Automatic labeling or classification of samples (e.g. topic tagging, quality scoring, safety classification). These are left to upstream tools or private pipelines.
- RLHF / DPO / preference-dataset construction beyond passing through
existing pairwise rows via the
chatadapter'spairwise_mode. - Training-job orchestration (SkyPilot, RunPod, Modal, K8s operators).
- Prompt templating / chat-template rendering for specific model
families. Output JSONL uses the standard
messages/alpacashapes; downstream trainers apply their own template. - Tokenizer-aware length filtering, packing, or curriculum scheduling. Those live in the training stack, not here.
- Scraping HTML pages or running browser automation. Structured JSON / JSONL / Parquet inputs only.
If any of these are important to your workflow, wire convmerge in as one
step of a larger pipeline rather than expecting it to grow into those areas.
Development
See CONTRIBUTING.md for the full guide — setup, local checks, code conventions, and a walkthrough for adding a new adapter / emitter. CI runs Ruff + pytest on Python 3.10 – 3.12.
pip install -e ".[dev,fetch-all,parquet]"
ruff check src tests
ruff format --check src tests
pytest -q
Participation in this project is governed by the Contributor Covenant Code of Conduct.
Good first PRs: new adapters / emitters for public dataset schemas, new
fetch backends (GitLab / Zenodo / Kaggle), recipe examples under
examples/, and docs improvements. Browse the
good first issue
label for concrete starting points.
PyPI release (maintainers)
Releases run from .github/workflows/publish.yml
on pushing a v* tag. Publishing authenticates via the PYPI_API_TOKEN
GitHub Actions secret.
- Create an API token on pypi.org.
- If the project already exists on PyPI, scope the token to the
convmergeproject (principle of least privilege). - For the very first upload (project not yet registered), PyPI does not allow project-scoped tokens — use Entire account scope for the first release, then rotate to a project-scoped token afterwards and revoke the original.
- If the project already exists on PyPI, scope the token to the
- In the GitHub repo, Settings → Secrets and variables → Actions → New
repository secret, add
PYPI_API_TOKENwith the token value. - Tag and push:
git tag vX.Y.Z && git push origin vX.Y.Z.
Changelog
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file convmerge-0.2.1.tar.gz.
File metadata
- Download URL: convmerge-0.2.1.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c578b31172d5e2a7b59a3145d77c227c394973ffba0eb79b630ee97da86d5179
|
|
| MD5 |
bf1f516c790003fc3281a90f126358f3
|
|
| BLAKE2b-256 |
de224600a7ef7c568a242655b2f2cbfcd8118cb993cef878c4a060f251e6e460
|
File details
Details for the file convmerge-0.2.1-py3-none-any.whl.
File metadata
- Download URL: convmerge-0.2.1-py3-none-any.whl
- Upload date:
- Size: 35.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa319f54abc854d9e2793748e699858f8948668312f1ad4f1497e09ee0c1ff73
|
|
| MD5 |
43322fe84d4330f73bb63cf3e640008d
|
|
| BLAKE2b-256 |
ee1a412a256526bbadfe52ab37658b7e4ea8bc0acba1e6061922cc88a9f3a5ff
|