Progressive curriculum learning for LLM training with fine-grained schedule control
Project description
Curriculus
Progressive curriculum learning for LLM training with fine-grained schedule control.
What is this?
Curriculus helps you gradually mix and transition between different datasets during training. Instead of throwing all your data at a model at once, you can start with simpler data (e.g., basic easy), smoothly transition to more complex data (e.g., medium), and finally move to task-specific data (e.g., hard tuning).
The key insight: linear interpolation between probability schedules. You define milestones (e.g., "at 20%, start mixing medium in"), and the library handles the smooth transition with mathematically correct sampling.
Why?
Training on progressively more complex data can:
- ✅ Improve model convergence and final performance
- ✅ Reduce training instability and catastrophic forgetting
- ✅ Allow precise control over when each dataset is used
- ✅ Handle datasets of different sizes gracefully
Installation
pip install curriculus
With PyTorch support:
pip install curriculus[torch]
Quick Start
Minimal Example (Sequential Fading)
from curriculus import CurriculusIterableDataset
# Your datasets
datasets = [
{"name": "easy", "dataset": easy_data},
{"name": "medium", "dataset": medium_data},
{"name": "hard", "dataset": hard_data},
]
# Auto-generates: easy -> medium -> hard with train/test split
dataset_dict = CurriculusIterableDataset(datasets, train_ratio=0.8)
# Use with your trainer
for sample in dataset_dict["train"]:
# sample comes from the appropriate dataset based on training progress
pass
Custom Schedule
from curriculus import CurriculusIterableDataset
# Explicit schedule: define milestones and weights
schedule = [
(0.0, {"easy": 1.0, "medium": 0.0, "hard": 0.0}),
(0.2, {"easy": 1.0, "medium": 0.0, "hard": 0.0}), # Warmup
(0.4, {"easy": 0.5, "medium": 0.5, "hard": 0.0}), # Easing
(0.6, {"easy": 0.0, "medium": 1.0, "hard": 0.0}), # Pure medium
(0.8, {"easy": 0.0, "medium": 0.5, "hard": 0.5}), # Mix
(1.0, {"easy": 0.0, "medium": 0.0, "hard": 1.0}), # Pure hard
]
dataset_dict = CurriculusIterableDataset(
datasets,
schedule=schedule,
total_steps=10000,
oversampling=True, # Repeat data if insufficient
best_effort=True, # Scale down gracefully if short (default)
train_ratio=0.9, # 90% train, 10% test
)
# Access splits
train_data = dataset_dict["train"]
test_data = dataset_dict["test"]
How It Works
Schedule Interpretation
A schedule is a list of (progress_percent, {dataset: weight}) tuples:
- progress_percent (0.0 to 1.0): Where you are in training
- weight: Probability of sampling from that dataset at this milestone
The library linearly interpolates between milestones. If you define:
- 0%: easy=1.0
- 100%: medium=1.0
Then at 50% progress, both have weight 0.5 (50/50 mix).
Automatic Scale-Down (Best Effort)
If you don't have enough data:
- best_effort=True (default): Reduces the dataset's sampling probability to make it last
- oversampling=True: Repeats data to fulfill the schedule
- Both False: Raises an error
Example: If medium appears in the schedule but you only have 50% of the required samples:
- Best effort scales it down by 50%
- Other datasets naturally expand to fill the gap
- Training completes without crashing
Dataset Sizes
Sizes are inferred automatically:
datasets = [
{"name": "A", "dataset": my_dataset}, # len() called automatically
]
Or specified manually:
datasets = [
{"name": "A", "dataset": huggingface_repo_id, "size": 50000}, # For streaming
]
Configuration Options
CurriculusIterableDataset
- datasets: List of
{"name": ..., "dataset": ...}dicts - schedule: List of
(progress, weights)tuples. If None, auto-generates sequential schedule. - total_steps: Total training steps. If None, sums all dataset sizes.
- oversampling: If True, repeats data when insufficient. Default: False.
- best_effort: If True, scales down dataset usage gracefully. Default: True.
- train_ratio: Fraction of total steps for train split (0.0-1.0). Default: 1.0 (train only).
- split_names: Tuple of (train_name, test_name). Default: (
"train","test").
Returns:
CurriculusIterableDatasetDict mapping of split names to iterable datasets
Real-World Example
from curriculus import CurriculusIterableDataset, CurriculusPlanner
# Step 1: Load your datasets
easy_data = load_dataset("my_dataset/easy")
medium_data = load_dataset("my_dataset/medium")
hard_data = load_dataset("my_dataset/hard")
# Step 2: Define the curriculum
datasets = [
{"name": "easy", "dataset": easy_data},
{"name": "medium", "dataset": medium_data},
{"name": "hard", "dataset": hard_data},
]
# Step 3: Create dataset with 85% train split
curriculum_dict = CurriculusIterableDataset(
datasets,
total_steps=100_000,
oversampling=True,
train_ratio=0.85
)
# Step 4: Use splits
for batch in DataLoader(curriculum_dict["train"], batch_size=32):
loss = model.train_step(batch)
for batch in DataLoader(curriculum_dict["test"], batch_size=32):
metrics = model.eval_step(batch)
Advanced: Pre-flight Validation
Check your schedule without training:
from curriculus import CurriculusPlanner
planner = CurriculusPlanner(
datasets,
schedule=my_schedule,
total_steps=100_000,
oversampling=False,
best_effort=True,
)
print(planner.get_plan_summary())
# Output:
# Total Steps: 100000
# Dataset Budget:
# easy: OK (1000000 available)
# medium: SCALED (50000 available, 60000 needed (0.83x))
# hard: OK (30000 available)
Architecture
The library separates concerns:
- CurriculusPlanner: Validates schedules, calculates sample budgets, pre-flight checks
- CurriculusIterableDataset: Implements the actual sampling at training time
This allows you to validate your configuration before training starts, catching issues early.
API Reference
CurriculusPlanner
Validates and calculates sample budgets.
planner = CurriculusPlanner(
datasets,
schedule=my_schedule,
total_steps=100_000,
oversampling=True,
best_effort=True,
)
# Inspect
print(planner.scale_factors) # Dict of scaling factors
print(planner.dataset_integrals) # Area under each curve
print(planner.get_plan_summary()) # Human-readable plan
CurriculusIterableDataset
Iterates over mixed samples.
dataset = CurriculusIterableDataset(
datasets,
schedule=...,
total_steps=100_000,
)
for sample in dataset:
# Sample is from the appropriate dataset based on progress
pass
generate_sequential_schedule
Auto-generates a simple crossfade schedule. This function is called by default if you don't provide a schedule, and you will rarely need to use it directly.
from curriculus import generate_sequential_schedule
schedule = generate_sequential_schedule(["dataset_A", "dataset_B", "dataset_C"])
# Result: A (100%) -> B (100%) -> C (100%)
Testing
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# With coverage
pytest --cov=curriculus
# View HTML coverage report
pytest --cov=curriculus --cov-report=html
# Open htmlcov/index.html
Contributing
Contributions welcome! Please:
- Fork the repo
- Create a feature branch
- Add tests for your changes
- Ensure tests pass:
pytest --cov=curriculus - Run linter:
ruff check --fix . - Submit a pull request
License
MIT License. See LICENSE file for details.
Citation
If you use this library in research, please cite:
@software{curriculus2025,
title={Curriculus: Progressive Curriculum Learning Datasets for LLM Training},
author={Omar Kamali},
year={2025},
url={https://github.com/omarkamali/curriculus}
}
Troubleshooting
"Dataset 'X' shortage!"
You have more schedule demand than available data:
- Solution 1: Enable
best_effort=True(default) - Solution 2: Enable
oversampling=True - Solution 3: Increase dataset size or reduce
total_steps
Weights don't sum to 1.0
Your schedule is invalid:
# ❌ Bad
schedule = [(0.0, {"A": 0.8, "B": 0.1})] # Sum = 0.9
# ✅ Good
schedule = [(0.0, {"A": 0.8, "B": 0.2})] # Sum = 1.0
All samples from one dataset
Check that your schedule includes all datasets. If a dataset doesn't appear in the schedule, it's never sampled.
Questions?
Open an issue: https://github.com/omarkamali/curriculus/issues
Example Notebooks
Explore end-to-end walkthroughs in the examples/ directory:
- Sequential difficulty fade – examples/01_easy_medium_hard.ipynb
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file curriculus-0.1.1.tar.gz.
File metadata
- Download URL: curriculus-0.1.1.tar.gz
- Upload date:
- Size: 99.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6a050e086a7c2bd9c6777b37156478bcf6f03cff5c275eccbf2971bc019ea92
|
|
| MD5 |
76d888d801f51c91c13be3e2fee6e3c4
|
|
| BLAKE2b-256 |
9a84a8c60d30055adb811efb373286dd52000d90ca5bbf3b3a95766b95c20105
|
Provenance
The following attestation bundles were made for curriculus-0.1.1.tar.gz:
Publisher:
publish.yml on omarkamali/curriculus
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
curriculus-0.1.1.tar.gz -
Subject digest:
d6a050e086a7c2bd9c6777b37156478bcf6f03cff5c275eccbf2971bc019ea92 - Sigstore transparency entry: 731253630
- Sigstore integration time:
-
Permalink:
omarkamali/curriculus@e829f6f2329ce0e79002d31e5d052230e590150a -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e829f6f2329ce0e79002d31e5d052230e590150a -
Trigger Event:
release
-
Statement type:
File details
Details for the file curriculus-0.1.1-py3-none-any.whl.
File metadata
- Download URL: curriculus-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b823590fb90aa7e4eafb7b653c95a369692c15dacfe33bfd06997b108e0e3a4
|
|
| MD5 |
db035d7684a2893c86796445bf200015
|
|
| BLAKE2b-256 |
7ee0c8cf2c6067e046dcb1a40aaec4c0c8f0cf3e4d2729020971effb177065d2
|
Provenance
The following attestation bundles were made for curriculus-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on omarkamali/curriculus
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
curriculus-0.1.1-py3-none-any.whl -
Subject digest:
7b823590fb90aa7e4eafb7b653c95a369692c15dacfe33bfd06997b108e0e3a4 - Sigstore transparency entry: 731253633
- Sigstore integration time:
-
Permalink:
omarkamali/curriculus@e829f6f2329ce0e79002d31e5d052230e590150a -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e829f6f2329ce0e79002d31e5d052230e590150a -
Trigger Event:
release
-
Statement type: