Skip to main content

Dataset mixing & curriculum optimizer — profile, blend, schedule, and budget training data. Zero deps.

Project description

datamix

CI Python 3.9+ License: Apache 2.0

Dataset mixing & curriculum optimizer for LLM training. Profile datasets, create mix recipes, schedule curricula, allocate token budgets, and clean data — all with zero dependencies.

Training data composition is one of the most impactful decisions in LLM training, yet there's no standard tooling for it. datamix gives you a programmatic way to profile, blend, schedule, and budget your training data.

datamix mix recipe and token allocation

Why datamix?

Problem datamix Solution
"What ratio of code vs. wiki should I use?" Temperature-scaled mixing with automatic weight computation
No way to profile datasets before mixing Instant profiling — token counts, lengths, quality metrics
Data curriculum is done manually in configs Programmatic scheduling — linear, cosine, step functions
Token budget allocation is guesswork Automatic budget computation with overflow detection
Quality filtering is scattered scripts Built-in length filter, exact/near dedup, quality scoring

Installation

pip install datamix            # zero dependencies
pip install datamix[cli]       # + click, rich for terminal UI
pip install datamix[all]       # everything

Quick Start

1. Profile your datasets

from datamix import profile_dataset, profile_jsonl, compare_profiles

# From a JSONL file
wiki = profile_jsonl("data/wikipedia.jsonl")
code = profile_jsonl("data/code-python.jsonl")

print(f"{wiki.name}: {wiki.n_examples:,} examples, {wiki.size_tokens_m:.1f}M tokens")
print(f"{code.name}: {code.n_examples:,} examples, {code.size_tokens_m:.1f}M tokens")

# Compare multiple datasets
comparison = compare_profiles([wiki, code])
print(f"Total: {comparison['total_tokens']:,} tokens across {comparison['n_datasets']} datasets")

2. Create a mix recipe

from datamix import create_recipe, MixStrategy

recipe = create_recipe(
    [wiki, code],
    strategy=MixStrategy.TEMPERATURE,
    temperature=1.5,  # >1 = more uniform, <1 = proportional
    total_tokens=2_000_000_000,
)

for name, weight in recipe.normalized_weights.items():
    print(f"  {name}: {weight:.1%}")

3. Schedule a curriculum

datamix curriculum schedule

from datamix import cosine_schedule, linear_schedule

# Cosine decay: primary dataset starts high, others increase
sched = cosine_schedule(
    ["wikipedia", "code", "arxiv", "books"],
    n_phases=4,
    primary="wikipedia",
    total_tokens=2_000_000_000,
)

# Get weights at any training progress point
weights_start = sched.weights_at(0.0)   # {"wikipedia": 0.93, ...}
weights_mid = sched.weights_at(0.5)     # {"wikipedia": 0.50, ...}
weights_end = sched.weights_at(1.0)     # {"wikipedia": 0.07, ...}

4. Allocate token budgets

from datamix import compute_budget, fit_to_budget, budget_report

# From a recipe
budget = compute_budget(recipe, [wiki, code])
print(budget_report(budget))

# Or fit datasets to a fixed budget
budget = fit_to_budget([wiki, code], token_budget=1_000_000_000)

5. Clean your data

from datamix import length_filter, dedup_exact, dedup_ngram, quality_score

# Filter by length
kept, stats = length_filter(texts, min_length=50, max_length=10000)
print(f"Kept {stats['kept']}, removed {stats['removed']}")

# Remove exact duplicates
kept, stats = dedup_exact(kept)

# Remove near-duplicates (n-gram Jaccard)
kept, stats = dedup_ngram(kept, n=5, threshold=0.8)

# Score individual examples
for text in kept[:5]:
    score = quality_score(text)
    print(f"  {score:.2f}  {text[:60]}...")

CLI

# Profile a JSONL file
datamix profile data/wiki.jsonl

# Create a mix recipe
datamix mix data/wiki.jsonl data/code.jsonl --strategy temperature --budget 2000000000

# Clean a dataset
datamix clean data/raw.jsonl --min-length 50 --dedup

Mixing Strategies

Strategy Description When to Use
PROPORTIONAL Weight by dataset size Default — larger datasets get more weight
TEMPERATURE Temperature-scaled proportional Control uniformity (T>1) vs. proportional (T<1)
EQUAL Equal weight per dataset When all datasets are equally important
CUSTOM Explicit weights When you know the exact ratios

Curriculum Types

Schedule Description
linear_schedule Linear interpolation from start to end weights
cosine_schedule Cosine decay for primary dataset, others increase
step_schedule Step function with explicit phase configs
custom_schedule Build from CurriculumPhase objects

Architecture

datamix/
├── _types.py        # DatasetProfile, MixRecipe, CurriculumSchedule, TokenBudget
├── profile.py       # Dataset profiling from lists or JSONL files
├── mixer.py         # Mix recipe creation, merging, scaling
├── curriculum.py    # Linear, cosine, step, custom curriculum schedules
├── sampler.py       # Temperature, proportional, stratified sampling
├── budget.py        # Token budget computation and allocation
├── quality.py       # Length filter, exact/near dedup, quality scoring
└── cli.py           # Click CLI interface

See Also

Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:

Project What it does
tokonomics Token counting & cost management for LLM APIs
datacrux Training data quality — dedup, PII, contamination
castwright Synthetic instruction data generation
toksight Tokenizer analysis & comparison
trainpulse Training health monitoring
ckpt Checkpoint inspection, diffing & merging
quantbench Quantization quality analysis
infermark Inference benchmarking
modeldiff Behavioral regression testing
vibesafe AI-generated code safety scanner
injectionguard Prompt injection detection

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamix-0.3.0.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamix-0.3.0-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file datamix-0.3.0.tar.gz.

File metadata

  • Download URL: datamix-0.3.0.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datamix-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d57c158fb96f6b92f40d89d6f6debe40f02fc7a0948078019b8a21d386d660cd
MD5 8aa9c109344381ff1aeec809844761d3
BLAKE2b-256 dc16fddbe2e460720bf37644c32993b065ee60521fd8a3925f7f68e0c2756405

See more details on using hashes here.

File details

Details for the file datamix-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: datamix-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datamix-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d5a64cacee9cd825b6d45a992f818acdb39c168fbee0ead73b079f366ff9d77
MD5 040c522856642f08b3a369c909a12aac
BLAKE2b-256 2a6ab44297be953f007cb3b8ff453352e98d319ebbc440a91b2a8305a02929af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page