Dataset mixing & curriculum optimizer — profile, blend, schedule, and budget training data. Zero deps.
Project description
datamix
Dataset mixing & curriculum optimizer for LLM training. Profile datasets, create mix recipes, schedule curricula, allocate token budgets, and clean data — all with zero dependencies.
Training data composition is one of the most impactful decisions in LLM training, yet there's no standard tooling for it. datamix gives you a programmatic way to profile, blend, schedule, and budget your training data.
Why datamix?
| Problem | datamix Solution |
|---|---|
| "What ratio of code vs. wiki should I use?" | Temperature-scaled mixing with automatic weight computation |
| No way to profile datasets before mixing | Instant profiling — token counts, lengths, quality metrics |
| Data curriculum is done manually in configs | Programmatic scheduling — linear, cosine, step functions |
| Token budget allocation is guesswork | Automatic budget computation with overflow detection |
| Quality filtering is scattered scripts | Built-in length filter, exact/near dedup, quality scoring |
Installation
pip install datamix # zero dependencies
pip install datamix[cli] # + click, rich for terminal UI
pip install datamix[all] # everything
Quick Start
1. Profile your datasets
from datamix import profile_dataset, profile_jsonl, compare_profiles
# From a JSONL file
wiki = profile_jsonl("data/wikipedia.jsonl")
code = profile_jsonl("data/code-python.jsonl")
print(f"{wiki.name}: {wiki.n_examples:,} examples, {wiki.size_tokens_m:.1f}M tokens")
print(f"{code.name}: {code.n_examples:,} examples, {code.size_tokens_m:.1f}M tokens")
# Compare multiple datasets
comparison = compare_profiles([wiki, code])
print(f"Total: {comparison['total_tokens']:,} tokens across {comparison['n_datasets']} datasets")
2. Create a mix recipe
from datamix import create_recipe, MixStrategy
recipe = create_recipe(
[wiki, code],
strategy=MixStrategy.TEMPERATURE,
temperature=1.5, # >1 = more uniform, <1 = proportional
total_tokens=2_000_000_000,
)
for name, weight in recipe.normalized_weights.items():
print(f" {name}: {weight:.1%}")
3. Schedule a curriculum
from datamix import cosine_schedule, linear_schedule
# Cosine decay: primary dataset starts high, others increase
sched = cosine_schedule(
["wikipedia", "code", "arxiv", "books"],
n_phases=4,
primary="wikipedia",
total_tokens=2_000_000_000,
)
# Get weights at any training progress point
weights_start = sched.weights_at(0.0) # {"wikipedia": 0.93, ...}
weights_mid = sched.weights_at(0.5) # {"wikipedia": 0.50, ...}
weights_end = sched.weights_at(1.0) # {"wikipedia": 0.07, ...}
4. Allocate token budgets
from datamix import compute_budget, fit_to_budget, budget_report
# From a recipe
budget = compute_budget(recipe, [wiki, code])
print(budget_report(budget))
# Or fit datasets to a fixed budget
budget = fit_to_budget([wiki, code], token_budget=1_000_000_000)
5. Clean your data
from datamix import length_filter, dedup_exact, dedup_ngram, quality_score
# Filter by length
kept, stats = length_filter(texts, min_length=50, max_length=10000)
print(f"Kept {stats['kept']}, removed {stats['removed']}")
# Remove exact duplicates
kept, stats = dedup_exact(kept)
# Remove near-duplicates (n-gram Jaccard)
kept, stats = dedup_ngram(kept, n=5, threshold=0.8)
# Score individual examples
for text in kept[:5]:
score = quality_score(text)
print(f" {score:.2f} {text[:60]}...")
CLI
# Profile a JSONL file
datamix profile data/wiki.jsonl
# Create a mix recipe
datamix mix data/wiki.jsonl data/code.jsonl --strategy temperature --budget 2000000000
# Clean a dataset
datamix clean data/raw.jsonl --min-length 50 --dedup
Mixing Strategies
| Strategy | Description | When to Use |
|---|---|---|
PROPORTIONAL |
Weight by dataset size | Default — larger datasets get more weight |
TEMPERATURE |
Temperature-scaled proportional | Control uniformity (T>1) vs. proportional (T<1) |
EQUAL |
Equal weight per dataset | When all datasets are equally important |
CUSTOM |
Explicit weights | When you know the exact ratios |
Curriculum Types
| Schedule | Description |
|---|---|
linear_schedule |
Linear interpolation from start to end weights |
cosine_schedule |
Cosine decay for primary dataset, others increase |
step_schedule |
Step function with explicit phase configs |
custom_schedule |
Build from CurriculumPhase objects |
Architecture
datamix/
├── _types.py # DatasetProfile, MixRecipe, CurriculumSchedule, TokenBudget
├── profile.py # Dataset profiling from lists or JSONL files
├── mixer.py # Mix recipe creation, merging, scaling
├── curriculum.py # Linear, cosine, step, custom curriculum schedules
├── sampler.py # Temperature, proportional, stratified sampling
├── budget.py # Token budget computation and allocation
├── quality.py # Length filter, exact/near dedup, quality scoring
└── cli.py # Click CLI interface
See Also
Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:
| Project | What it does |
|---|---|
| tokonomics | Token counting & cost management for LLM APIs |
| datacrux | Training data quality — dedup, PII, contamination |
| castwright | Synthetic instruction data generation |
| toksight | Tokenizer analysis & comparison |
| trainpulse | Training health monitoring |
| ckpt | Checkpoint inspection, diffing & merging |
| quantbench | Quantization quality analysis |
| infermark | Inference benchmarking |
| modeldiff | Behavioral regression testing |
| vibesafe | AI-generated code safety scanner |
| injectionguard | Prompt injection detection |
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamix-0.3.0.tar.gz.
File metadata
- Download URL: datamix-0.3.0.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d57c158fb96f6b92f40d89d6f6debe40f02fc7a0948078019b8a21d386d660cd
|
|
| MD5 |
8aa9c109344381ff1aeec809844761d3
|
|
| BLAKE2b-256 |
dc16fddbe2e460720bf37644c32993b065ee60521fd8a3925f7f68e0c2756405
|
File details
Details for the file datamix-0.3.0-py3-none-any.whl.
File metadata
- Download URL: datamix-0.3.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d5a64cacee9cd825b6d45a992f818acdb39c168fbee0ead73b079f366ff9d77
|
|
| MD5 |
040c522856642f08b3a369c909a12aac
|
|
| BLAKE2b-256 |
2a6ab44297be953f007cb3b8ff453352e98d319ebbc440a91b2a8305a02929af
|