Skip to main content

Progressive curriculum learning for LLM training with fine-grained schedule control

Project description

Curriculus

PyPI version Test Matrix License: MIT

Progressive curriculum learning for LLM training with fine-grained schedule control.

What is this?

Curriculus helps you gradually mix and transition between different datasets during training. Instead of throwing all your data at a model at once, you can start with simpler data (e.g., basic easy), smoothly transition to more complex data (e.g., medium), and finally move to task-specific data (e.g., hard tuning).

The key insight: linear interpolation between probability schedules. You define milestones (e.g., "at 20%, start mixing medium in"), and the library handles the smooth transition with mathematically correct sampling.

Why?

Training on progressively more complex data can:

  • ✅ Improve model convergence and final performance
  • ✅ Reduce training instability and catastrophic forgetting
  • ✅ Allow precise control over when each dataset is used
  • ✅ Handle datasets of different sizes gracefully

Installation

pip install curriculus

With PyTorch support:

pip install curriculus[torch]

Quick Start

Minimal Example (Sequential Fading)

from curriculus import CurriculusIterableDataset

# Your datasets
datasets = [
    {"name": "easy", "dataset": easy_data},
    {"name": "medium", "dataset": medium_data},
    {"name": "hard", "dataset": hard_data},
]

# Auto-generates: easy -> medium -> hard
dataset = CurriculusIterableDataset(datasets)

# Use with your trainer
for sample in dataset:
    # sample comes from the appropriate dataset based on training progress
    pass

Custom Schedule

from curriculus import CurriculusIterableDataset

# Explicit schedule: define milestones and weights
schedule = [
    (0.0, {"easy": 1.0, "medium": 0.0, "hard": 0.0}),
    (0.2, {"easy": 1.0, "medium": 0.0, "hard": 0.0}),  # Warmup
    (0.4, {"easy": 0.5, "medium": 0.5, "hard": 0.0}),  # Easing
    (0.6, {"easy": 0.0, "medium": 1.0, "hard": 0.0}),  # Pure medium
    (0.8, {"easy": 0.0, "medium": 0.5, "hard": 0.5}),  # Mix
    (1.0, {"easy": 0.0, "medium": 0.0, "hard": 1.0}),  # Pure hard
]

dataset = CurriculusIterableDataset(
    datasets,
    schedule=schedule,
    total_steps=10000,
    oversampling=True,  # Repeat data if insufficient
    best_effort=True,   # Scale down gracefully if short (default)
)

for sample in dataset:
    pass

How It Works

Schedule Interpretation

A schedule is a list of (progress_percent, {dataset: weight}) tuples:

  • progress_percent (0.0 to 1.0): Where you are in training
  • weight: Probability of sampling from that dataset at this milestone

The library linearly interpolates between milestones. If you define:

  • 0%: easy=1.0
  • 100%: medium=1.0

Then at 50% progress, both have weight 0.5 (50/50 mix).

Automatic Scale-Down (Best Effort)

If you don't have enough data:

  • best_effort=True (default): Reduces the dataset's sampling probability to make it last
  • oversampling=True: Repeats data to fulfill the schedule
  • Both False: Raises an error

Example: If medium appears in the schedule but you only have 50% of the required samples:

  • Best effort scales it down by 50%
  • Other datasets naturally expand to fill the gap
  • Training completes without crashing

Dataset Sizes

Sizes are inferred automatically:

datasets = [
    {"name": "A", "dataset": my_dataset},  # len() called automatically
]

Or specified manually:

datasets = [
    {"name": "A", "dataset": huggingface_repo_id, "size": 50000},  # For streaming
]

Configuration Options

CurriculusIterableDataset

  • datasets: List of {"name": ..., "dataset": ...} dicts
  • schedule: List of (progress, weights) tuples. If None, auto-generates sequential schedule.
  • total_steps: Total training steps. If None, sums all dataset sizes.
  • oversampling: If True, repeats data when insufficient. Default: False.
  • best_effort: If True, scales down dataset usage gracefully. Default: True.

Real-World Example

from curriculus import CurriculusIterableDataset, CurriculusPlanner

# Step 1: Load your datasets
easy_data = load_dataset("my_dataset/easy")
medium_data = load_dataset("my_dataset/medium")
hard_data = load_dataset("my_dataset/hard")

# Step 2: Define the curriculum
datasets = [
    {"name": "easy", "dataset": easy_data},
    {"name": "medium", "dataset": medium_data},
    {"name": "hard", "dataset": hard_data},
]

# Step 3: Create the dataset
curriculum_ds = CurriculusIterableDataset(
    datasets,
    total_steps=100_000,
    oversampling=True,  # hard data is small, so repeat
)

# Step 4: Use in training loop
for batch in DataLoader(curriculum_ds, batch_size=32):
    loss = model.train_step(batch)

Advanced: Pre-flight Validation

Check your schedule without training:

from curriculus import CurriculusPlanner

planner = CurriculusPlanner(
    datasets,
    schedule=my_schedule,
    total_steps=100_000,
    oversampling=False,
    best_effort=True,
)

print(planner.get_plan_summary())
# Output:
# Total Steps: 100000
# Dataset Budget:
#   easy: OK (1000000 available)
#   medium: SCALED (50000 available, 60000 needed (0.83x))
#   hard: OK (30000 available)

Architecture

The library separates concerns:

  • CurriculusPlanner: Validates schedules, calculates sample budgets, pre-flight checks
  • CurriculusIterableDataset: Implements the actual sampling at training time

This allows you to validate your configuration before training starts, catching issues early.

API Reference

CurriculusPlanner

Validates and calculates sample budgets.

planner = CurriculusPlanner(
    datasets,
    schedule=my_schedule,
    total_steps=100_000,
    oversampling=True,
    best_effort=True,
)

# Inspect
print(planner.scale_factors)  # Dict of scaling factors
print(planner.dataset_integrals)  # Area under each curve
print(planner.get_plan_summary())  # Human-readable plan

CurriculusIterableDataset

Iterates over mixed samples.

dataset = CurriculusIterableDataset(
    datasets,
    schedule=...,
    total_steps=100_000,
)

for sample in dataset:
    # Sample is from the appropriate dataset based on progress
    pass

generate_sequential_schedule

Auto-generates a simple crossfade schedule. This function is called by default if you don't provide a schedule, and you will rarely need to use it directly.

from curriculus import generate_sequential_schedule

schedule = generate_sequential_schedule(["dataset_A", "dataset_B", "dataset_C"])
# Result: A (100%) -> B (100%) -> C (100%)

Testing

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# With coverage
pytest --cov=curriculus

# View HTML coverage report
pytest --cov=curriculus --cov-report=html
# Open htmlcov/index.html

Contributing

Contributions welcome! Please:

  1. Fork the repo
  2. Create a feature branch
  3. Add tests for your changes
  4. Ensure tests pass: pytest --cov=curriculus
  5. Run linter: ruff check --fix .
  6. Submit a pull request

License

MIT License. See LICENSE file for details.

Citation

If you use this library in research, please cite:

@software{curriculus2025,
  title={Curriculus: Progressive Curriculum Learning Datasets for LLM Training},
  author={Omar Kamali},
  year={2025},
  url={https://github.com/omarkamali/curriculus}
}

Troubleshooting

"Dataset 'X' shortage!"

You have more schedule demand than available data:

  • Solution 1: Enable best_effort=True (default)
  • Solution 2: Enable oversampling=True
  • Solution 3: Increase dataset size or reduce total_steps

Weights don't sum to 1.0

Your schedule is invalid:

# ❌ Bad
schedule = [(0.0, {"A": 0.8, "B": 0.1})]  # Sum = 0.9

# ✅ Good
schedule = [(0.0, {"A": 0.8, "B": 0.2})]  # Sum = 1.0

All samples from one dataset

Check that your schedule includes all datasets. If a dataset doesn't appear in the schedule, it's never sampled.

Questions?

Open an issue: https://github.com/omarkamali/curriculus/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curriculus-0.1.0.tar.gz (97.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

curriculus-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file curriculus-0.1.0.tar.gz.

File metadata

  • Download URL: curriculus-0.1.0.tar.gz
  • Upload date:
  • Size: 97.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for curriculus-0.1.0.tar.gz
Algorithm Hash digest
SHA256 85e48ec301c4120145ef15a458a62870072621dbff7c3627a78535fe5cfbf9fa
MD5 1b777d332f3d3a32016900f5912fe239
BLAKE2b-256 44c995e883227ef60ec3385d019e69064260537a4c302fb1b9eba2489a31636a

See more details on using hashes here.

Provenance

The following attestation bundles were made for curriculus-0.1.0.tar.gz:

Publisher: publish.yml on omarkamali/curriculus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file curriculus-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: curriculus-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for curriculus-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ff78d7f39c5df7dd8fde049f0e406166e4f1b046ea00d00530de1e3b569ca55
MD5 52af9a4b673d7a555d8c3f92c9b7dc82
BLAKE2b-256 6eaa28f390370488fbe664916bf08da5a5ec2fb518b3bd0e9b8579931de43f23

See more details on using hashes here.

Provenance

The following attestation bundles were made for curriculus-0.1.0-py3-none-any.whl:

Publisher: publish.yml on omarkamali/curriculus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page