Dataset mixing & curriculum optimizer — profile, blend, schedule, and budget training data. Zero deps.

These details have not been verified by PyPI

Project links

Project description

datamix

Dataset mixing & curriculum optimizer for LLM training. Profile datasets, create mix recipes, schedule curricula, allocate token budgets, and clean data — all with zero dependencies.

Training data composition is one of the most impactful decisions in LLM training, yet there's no standard tooling for it. datamix gives you a programmatic way to profile, blend, schedule, and budget your training data.

datamix mix recipe and token allocation

Why datamix?

Problem	datamix Solution
"What ratio of code vs. wiki should I use?"	Temperature-scaled mixing with automatic weight computation
No way to profile datasets before mixing	Instant profiling — token counts, lengths, quality metrics
Data curriculum is done manually in configs	Programmatic scheduling — linear, cosine, step functions
Token budget allocation is guesswork	Automatic budget computation with overflow detection
Quality filtering is scattered scripts	Built-in length filter, exact/near dedup, quality scoring

Installation

pip install datamix            # zero dependencies
pip install datamix[cli]       # + click, rich for terminal UI
pip install datamix[all]       # everything

Quick Start

1. Profile your datasets

from datamix import profile_dataset, profile_jsonl, compare_profiles

# From a JSONL file
wiki = profile_jsonl("data/wikipedia.jsonl")
code = profile_jsonl("data/code-python.jsonl")

print(f"{wiki.name}: {wiki.n_examples:,} examples, {wiki.size_tokens_m:.1f}M tokens")
print(f"{code.name}: {code.n_examples:,} examples, {code.size_tokens_m:.1f}M tokens")

# Compare multiple datasets
comparison = compare_profiles([wiki, code])
print(f"Total: {comparison['total_tokens']:,} tokens across {comparison['n_datasets']} datasets")

2. Create a mix recipe

from datamix import create_recipe, MixStrategy

recipe = create_recipe(
    [wiki, code],
    strategy=MixStrategy.TEMPERATURE,
    temperature=1.5,  # >1 = more uniform, <1 = proportional
    total_tokens=2_000_000_000,
)

for name, weight in recipe.normalized_weights.items():
    print(f"  {name}: {weight:.1%}")

3. Schedule a curriculum

datamix curriculum schedule

from datamix import cosine_schedule, linear_schedule

# Cosine decay: primary dataset starts high, others increase
sched = cosine_schedule(
    ["wikipedia", "code", "arxiv", "books"],
    n_phases=4,
    primary="wikipedia",
    total_tokens=2_000_000_000,
)

# Get weights at any training progress point
weights_start = sched.weights_at(0.0)   # {"wikipedia": 0.93, ...}
weights_mid = sched.weights_at(0.5)     # {"wikipedia": 0.50, ...}
weights_end = sched.weights_at(1.0)     # {"wikipedia": 0.07, ...}

4. Allocate token budgets

from datamix import compute_budget, fit_to_budget, budget_report

# From a recipe
budget = compute_budget(recipe, [wiki, code])
print(budget_report(budget))

# Or fit datasets to a fixed budget
budget = fit_to_budget([wiki, code], token_budget=1_000_000_000)

5. Clean your data

from datamix import length_filter, dedup_exact, dedup_ngram, quality_score

# Filter by length
kept, stats = length_filter(texts, min_length=50, max_length=10000)
print(f"Kept {stats['kept']}, removed {stats['removed']}")

# Remove exact duplicates
kept, stats = dedup_exact(kept)

# Remove near-duplicates (n-gram Jaccard)
kept, stats = dedup_ngram(kept, n=5, threshold=0.8)

# Score individual examples
for text in kept[:5]:
    score = quality_score(text)
    print(f"  {score:.2f}  {text[:60]}...")

CLI

# Profile a JSONL file
datamix profile data/wiki.jsonl

# Create a mix recipe
datamix mix data/wiki.jsonl data/code.jsonl --strategy temperature --budget 2000000000

# Clean a dataset
datamix clean data/raw.jsonl --min-length 50 --dedup

Mixing Strategies

Strategy	Description	When to Use
`PROPORTIONAL`	Weight by dataset size	Default — larger datasets get more weight
`TEMPERATURE`	Temperature-scaled proportional	Control uniformity (T>1) vs. proportional (T<1)
`EQUAL`	Equal weight per dataset	When all datasets are equally important
`CUSTOM`	Explicit weights	When you know the exact ratios

Curriculum Types

Schedule	Description
`linear_schedule`	Linear interpolation from start to end weights
`cosine_schedule`	Cosine decay for primary dataset, others increase
`step_schedule`	Step function with explicit phase configs
`custom_schedule`	Build from CurriculumPhase objects

Architecture

datamix/
├── _types.py        # DatasetProfile, MixRecipe, CurriculumSchedule, TokenBudget
├── profile.py       # Dataset profiling from lists or JSONL files
├── mixer.py         # Mix recipe creation, merging, scaling
├── curriculum.py    # Linear, cosine, step, custom curriculum schedules
├── sampler.py       # Temperature, proportional, stratified sampling
├── budget.py        # Token budget computation and allocation
├── quality.py       # Length filter, exact/near dedup, quality scoring
└── cli.py           # Click CLI interface

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
castwright	Synthetic instruction data generation
toksight	Tokenizer analysis & comparison
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
infermark	Inference benchmarking
modeldiff	Behavioral regression testing
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 10, 2026

This version

0.2.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamix-0.2.0.tar.gz (34.3 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datamix-0.2.0-py3-none-any.whl (22.7 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file datamix-0.2.0.tar.gz.

File metadata

Download URL: datamix-0.2.0.tar.gz
Upload date: Apr 10, 2026
Size: 34.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datamix-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2e7ac245f27543327a2d0ca6dedca01573f38b09d97f76495694820ff51a990f`
MD5	`9f7b58cb3966e8259eac46f2ed5eb014`
BLAKE2b-256	`aa757c83c467e89e6012cc2047876a17dc9e4feeab3703a069453852436a4c83`

See more details on using hashes here.

File details

Details for the file datamix-0.2.0-py3-none-any.whl.

File metadata

Download URL: datamix-0.2.0-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 22.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datamix-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f65b45f038234ba7a3f0e896f8dad03fe0f96952b74e4fc9cc1197ec6e34069`
MD5	`ac7b1679f968a8e70132f473aa660533`
BLAKE2b-256	`7a53319c584ff95740cf66639187f509ad3826aa4fd92694a7e60a133ba793da`

See more details on using hashes here.

datamix 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

datamix

Why datamix?

Installation

Quick Start

1. Profile your datasets

2. Create a mix recipe

3. Schedule a curriculum

4. Allocate token budgets

5. Clean your data

CLI

Mixing Strategies

Curriculum Types

Architecture

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes