SAGE Data - Unified dataset management module for SAGE benchmark suite
Project description
SAGE Data ๐# SAGE Data ๏ฟฝ๏ฟฝ
Dataset management module for SAGE benchmark suite****Dataset management module for SAGE benchmark suite
Provides unified access to multiple datasets through a two-layer architecture:Provides unified access to multiple datasets through a two-layer architecture:
-
Sources: Physical datasets in
sage/data/sources/(qa_base, bbh, mmlu, gpqa, locomo, orca_dpo, agent_benchmark, etc.)- Sources: Physical datasets (qa_base, bbh, mmlu, gpqa, locomo, orca_dpo) -
Usages: Logical views for experiments documented in
docs/usages/- Usages: Logical views for experiments (rag, libamm, neuromem, agent_eval)
๐ Quick Start## Quick Start
Installation```python
from sage.data import DataManager
# Run the quickstart script (recommended)manager = DataManager.get_instance()
./quickstart.sh
# Access datasets by logical usage profile
# Or install manuallyrag = manager.get_by_usage("rag")
pip install -e .qa_loader = rag.load("qa_base") # already instantiated
queries = qa_loader.load_queries()
# Install with optional dependencies
pip install -e ".[all]" # All datasets# Or fetch a specific data source directly
pip install -e ".[datasets]" # Hugging Face datasetsbbh_loader = manager.get_by_source("bbh")
pip install -e ".[alignment]" # DPO/alignment toolstasks = bbh_loader.get_task_names()
Basic Usage## Available Datasets
from sage.data import DataManager|---------|-------------|-------------------|---------|
| **qa_base** | Question-Answering with knowledge base | โ No (included) | Local files |
manager = DataManager.get_instance()| **locomo** | Long-context memory benchmark | โ
Yes (`python -m locomo.download`) | Local files (2.68MB) |
| **bbh** | BIG-Bench Hard reasoning tasks | โ No (included) | Local JSON files |
# Access datasets by logical usage profile| **mmlu** | Massive Multitask Language Understanding | ๐ฅ Optional (`python -m mmlu.download --all-subjects`) | On-demand or Local (~160MB) |
rag = manager.get_by_usage("rag")| **gpqa** | Graduate-Level Question Answering | โ
Auto (Hugging Face) | On-demand (~5MB cached) |
qa_loader = rag.load("qa_base")| **orca_dpo** | Preference pairs for alignment/DPO | โ
Auto (Hugging Face) | On-demand (varies) |
queries = qa_loader.load_queries()
See `examples/` for detailed usage examples.
# Or fetch a specific data source directly
bbh_loader = manager.get_by_source("bbh")## ๐ Examples
tasks = bbh_loader.get_task_names()
```bash
# Access Orca DPO for alignment researchpython examples/qa_examples.py # QA dataset usage
from sage.data.sources.orca_dpo import OrcaDPODataLoaderpython examples/locomo_examples.py # LoCoMo dataset usage
dpo_loader = OrcaDPODataLoader()python examples/bbh_examples.py # BBH dataset usage
examples = dpo_loader.load_data(split="train")python examples/mmlu_examples.py # MMLU dataset usage
```python examples/gpqa_examples.py # GPQA dataset usage
python examples/orca_dpo_examples.py # Orca DPO dataset usage
## ๐ฆ Available Datasetspython examples/integration_example.py # Cross-dataset integration
| Dataset | Description | Download Required | Storage | Location |
|---------|-------------|-------------------|---------|----------|## License
| qa_base | Question-Answering with knowledge base | โ No (included) | Local files | sage/data/sources/qa_base/ |
| locomo | Long-context memory benchmark | โ
Yes (python -m locomo.download) | Local (2.68MB) | sage/data/sources/locomo/ |MIT License - see LICENSE file.
| bbh | BIG-Bench Hard reasoning tasks | โ No (included) | Local JSON | sage/data/sources/bbh/ |
| mmlu | Massive Multitask Language Understanding | ๐ฅ Optional | On-demand/Local (~160MB) | sage/data/sources/mmlu/ |## ๐ Links
| gpqa | Graduate-Level Question Answering | โ
Auto (HF) | On-demand (~5MB) | sage/data/sources/gpqa/ |
| orca_dpo | Preference pairs for alignment/DPO | โ
Auto (HF) | On-demand (varies) | sage/data/sources/orca_dpo/ |- Repository: https://github.com/intellistream/sageData
| agent_benchmark | Agent evaluation tasks | โ No (included) | Local files | sage/data/sources/agent_benchmark/ |- Issues: https://github.com/intellistream/sageData/issues
| agent_tools | Tool catalog for agents | โ No (included) | Local files | sage/data/sources/agent_tools/ |
| agent_sft | SFT conversation data | โ No (included) | Local files | sage/data/sources/agent_sft/ |## โ Common Issues
๐ ExamplesQ: Where's the LoCoMo data?
A: Run python -m locomo.download to download it (2.68MB from Hugging Face).
python examples/qa_examples.py # QA dataset usage**Q: How to download MMLU for offline use?**
python examples/locomo_examples.py # LoCoMo dataset usageA: Run `python -m mmlu.download --all-subjects` to download all subjects (~160MB).
python examples/bbh_examples.py # BBH dataset usage
python examples/mmlu_examples.py # MMLU dataset usage**Q: GPQA access error?**
python examples/gpqa_examples.py # GPQA dataset usageA: You need to accept the dataset terms on Hugging Face: https://huggingface.co/datasets/Idavidrein/gpqa
python examples/orca_dpo_examples.py # Orca DPO dataset usage
python examples/integration_example.py # Cross-dataset integration**Q: How to use Orca DPO for alignment research?**
```A: Use `DataManager.get_by_source("orca_dpo")` to get the loader, then use `format_for_dpo()` to prepare data for training.
## ๐ Project Structure---
```**Version**: 0.1.0 | **Last Updated**: December 2025
sageData/
โโโ quickstart.sh # Quick setup script
โโโ .pre-commit-config.yaml # Code quality hooks
โโโ pyproject.toml # Package configuration
โโโ README.md # This file
โโโ LICENSE # MIT license
โโโ docs/ # Documentation
โ โโโ ARCHITECTURE.md # System architecture
โ โโโ usages/ # Usage profiles (rag, agent_eval, etc.)
โโโ sage/ # Main package
โ โโโ data/
โ โโโ manager.py # DataManager singleton
โ โโโ sources/ # All dataset sources
โ โโโ orca_dpo/ # DPO preference data
โ โโโ agent_benchmark/
โ โโโ agent_tools/
โ โโโ ...
โโโ examples/ # Usage examples
โโโ tests/ # Test suite
๐ ๏ธ Development
Setup Development Environment
# Run quickstart with development dependencies
./quickstart.sh
# Or manually install dev dependencies
pip install pytest pytest-cov black flake8 isort mypy pre-commit
pre-commit install
Run Tests
pytest tests/
pytest tests/ -v --cov=sage
Code Quality
Pre-commit hooks automatically run on git commit:
- ruff check: Code linting (replaces flake8, isort, pyupgrade)
- ruff format: Code formatting (replaces black)
- mypy: Type checking
Run manually:
pre-commit run --all-files
๐ Documentation
- Architecture: See
docs/ARCHITECTURE.mdfor system design - Usage Profiles: See
docs/usages/for experiment configurations - API Reference: Use
help(DataManager)in Python
License
MIT License - see LICENSE file.
๐ Links
- Repository: https://github.com/intellistream/sageData
- Issues: https://github.com/intellistream/sageData/issues
โ Common Issues
Q: Where's the LoCoMo data?
A: Run python -m locomo.download to download it (2.68MB from Hugging Face).
Q: How to download MMLU for offline use?
A: Run python -m mmlu.download --all-subjects to download all subjects (~160MB).
Q: GPQA access error?
A: Accept dataset terms on Hugging Face: https://huggingface.co/datasets/Idavidrein/gpqa
Q: How to use Orca DPO for alignment research?
A: Import from sage.data.sources.orca_dpo and use format_for_dpo() to prepare training data.
Q: Where did the root-level docs go?
A: All documentation is now in the docs/ directory for better organization.
๐ Recent Changes (v0.2.0)
- โ
Added
quickstart.shfor easy setup - โ
Added
.pre-commit-config.yamlfor code quality - โ
Moved
orca_dpotosage/data/sources/ - โ
Moved documentation to
docs/directory - โ
Moved usage profiles to
docs/usages/ - โ Improved project structure and organization
Version: 0.2.0 | Last Updated: January 2026
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file isage_data-0.2.0.tar.gz.
File metadata
- Download URL: isage_data-0.2.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
007242a04cb97dd875f555d216fc8f2f9a80c84ab67b30df209f67b159b38d89
|
|
| MD5 |
f4350a1d1fe944915a8c1d8c3f0d4eda
|
|
| BLAKE2b-256 |
631cea53e2f38f1485438b98d02504cf50170738a3b438e3a2de62d9f208c59e
|
File details
Details for the file isage_data-0.2.0-py3-none-any.whl.
File metadata
- Download URL: isage_data-0.2.0-py3-none-any.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7e07c8a8598f196d510cca37c2ee11af68455794005d65ac63855f7ab6ff56c
|
|
| MD5 |
21df191b3c46a7e89bd08896bedcb13c
|
|
| BLAKE2b-256 |
4cc26cb46519fdd88c6a68d047a34a98e8162a643dda232413766c9b40ca08b1
|