SAGE Data - Unified dataset management module for SAGE benchmark suite
Project description
SAGE Data ๐
Dataset management module for SAGE benchmark suite
Provides unified access to multiple datasets through a two-layer architecture:
- Sources: Physical datasets in
sage/data/sources/(qa_base, bbh, mmlu, gpqa, locomo, orca_dpo, agent_benchmark, agent_sft, agent_tools, etc.) - Usages: Logical views for experiments in
sage/data/usages/(rag, libamm, neuromem, agent_eval)
๐ Quick Start
Installation
# Run the quickstart script (recommended)
./quickstart.sh
# Or install manually
pip install -e .
# Install with optional dependencies
pip install -e ".[all]" # All datasets
pip install -e ".[datasets]" # Hugging Face datasets
pip install -e ".[alignment]" # DPO/alignment tools
pip install -e ".[agent]" # Agent datasets
Basic Usage
from sage.data import DataManager
manager = DataManager.get_instance()
# Access datasets by logical usage profile
rag = manager.get_by_usage("rag")
qa_loader = rag.load("qa_base") # already instantiated
queries = qa_loader.load_queries()
# Or fetch a specific data source directly
bbh_loader = manager.get_by_source("bbh")
tasks = bbh_loader.get_task_names()
Available Datasets
| Dataset | Description | Download Required | Storage |
|---|---|---|---|
| qa_base | Question-Answering with knowledge base | โ No (included) | Local files |
| locomo | Long-context memory benchmark | โ
Yes (python -m locomo.download) |
Local files (2.68MB) |
| bbh | BIG-Bench Hard reasoning tasks | โ No (included) | Local JSON files |
| mmlu | Massive Multitask Language Understanding | ๐ฅ Optional (python -m mmlu.download --all-subjects) |
On-demand or Local (~160MB) |
| gpqa | Graduate-Level Question Answering | โ Auto (Hugging Face) | On-demand (~5MB cached) |
| orca_dpo | Preference pairs for alignment/DPO | โ Auto (Hugging Face) | On-demand (varies) |
| agent_benchmark | Agent evaluation tasks | โ No (included) | Local JSON files |
| agent_sft | Agent supervised fine-tuning conversations | โ No (included) | Local JSON files |
| agent_tools | Agent tool catalog and schemas | โ No (included) | Local JSON files |
See examples/ for detailed usage examples.
๐ Project Structure
sage/data/
โโโ sources/ # Physical dataset loaders
โ โโโ qa_base/ # Q&A with knowledge base
โ โโโ bbh/ # BIG-Bench Hard tasks
โ โโโ mmlu/ # MMLU benchmark
โ โโโ gpqa/ # Graduate-level Q&A
โ โโโ locomo/ # Long-context memory
โ โโโ orca_dpo/ # DPO preference pairs
โ โโโ agent_benchmark/ # Agent evaluation tasks
โ โโโ agent_sft/ # Agent SFT conversations
โ โโโ agent_tools/ # Agent tool catalog
โโโ usages/ # Logical views and profiles
โโโ rag/ # RAG experiments
โโโ libamm/ # LibAMM benchmarks
โโโ neuromem/ # Neuromem experiments
โโโ agent_eval/ # Agent evaluation profiles
See docs/ARCHITECTURE.md for detailed design documentation.
๐ Examples
python examples/qa_examples.py # QA dataset usage
python examples/locomo_examples.py # LoCoMo dataset usage
python examples/bbh_examples.py # BBH dataset usage
python examples/mmlu_examples.py # MMLU dataset usage
python examples/gpqa_examples.py # GPQA dataset usage
python examples/orca_dpo_examples.py # Orca DPO dataset usage
python examples/integration_example.py # Cross-dataset integration
License
MIT License - see LICENSE file.
๐ Links
- Repository: https://github.com/intellistream/sageData
- Issues: https://github.com/intellistream/sageData/issues
โ Common Issues
Q: Where's the LoCoMo data?
A: Run python -m locomo.download to download it (2.68MB from Hugging Face).
Q: How to download MMLU for offline use?
A: Run python -m mmlu.download --all-subjects to download all subjects (~160MB).
Q: GPQA access error?
A: You need to accept the dataset terms on Hugging Face: https://huggingface.co/datasets/Idavidrein/gpqa
Q: How to use Orca DPO for alignment research?
A: Use DataManager.get_by_source("orca_dpo") to get the loader, then use format_for_dpo() to prepare data for training.
Version: 0.2.1.0 | Last Updated: January 2026
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file isage_data-0.2.1.0.tar.gz.
File metadata
- Download URL: isage_data-0.2.1.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e5a7a0db5102d4d4a0a93909387bc00c99cd0abff410a7d516521fba0315a52
|
|
| MD5 |
d2942c45df89ea47b8b6fb4d4268bb80
|
|
| BLAKE2b-256 |
7df07d3ae36fd242206d7d482d65322820b06530d29669afff36f644af2f5257
|
File details
Details for the file isage_data-0.2.1.0-py3-none-any.whl.
File metadata
- Download URL: isage_data-0.2.1.0-py3-none-any.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
290bc06c62a05f27d1d69cdf8c96ba914ffd2a842bfb24d049cb81e3b537b48f
|
|
| MD5 |
1fa98af383c644c1ec3a2047e44a72c9
|
|
| BLAKE2b-256 |
27c4dfd5f445b2692fe28b101a6f82938947cf34606f026178880f3a6ed3319e
|