SAGE Data - Unified data loaders for memory benchmark datasets (LongMemEval, Locomo, MemAgentBench, etc.)
Project description
SAGE Data ��
Dataset management module for SAGE benchmark suite
Provides unified access to multiple datasets through a two-layer architecture:
- Sources: Physical datasets (qa_base, bbh, mmlu, gpqa, locomo, orca_dpo)
- Usages: Logical views for experiments (rag, libamm, neuromem, agent_eval)
Quick Start
from sage.data import DataManager
manager = DataManager.get_instance()
# Access datasets by logical usage profile
rag = manager.get_by_usage("rag")
qa_loader = rag.load("qa_base") # already instantiated
queries = qa_loader.load_queries()
# Or fetch a specific data source directly
bbh_loader = manager.get_by_source("bbh")
tasks = bbh_loader.get_task_names()
Available Datasets
| Dataset | Description | Download Required | Storage |
|---|---|---|---|
| qa_base | Question-Answering with knowledge base | ❌ No (included) | Local files |
| locomo | Long-context memory benchmark | ✅ Yes (python -m locomo.download) |
Local files (2.68MB) |
| bbh | BIG-Bench Hard reasoning tasks | ❌ No (included) | Local JSON files |
| mmlu | Massive Multitask Language Understanding | 📥 Optional (python -m mmlu.download --all-subjects) |
On-demand or Local (~160MB) |
| gpqa | Graduate-Level Question Answering | ✅ Auto (Hugging Face) | On-demand (~5MB cached) |
| orca_dpo | Preference pairs for alignment/DPO | ✅ Auto (Hugging Face) | On-demand (varies) |
See examples/ for detailed usage examples.
📖 Examples
python examples/qa_examples.py # QA dataset usage
python examples/locomo_examples.py # LoCoMo dataset usage
python examples/bbh_examples.py # BBH dataset usage
python examples/mmlu_examples.py # MMLU dataset usage
python examples/gpqa_examples.py # GPQA dataset usage
python examples/orca_dpo_examples.py # Orca DPO dataset usage
python examples/integration_example.py # Cross-dataset integration
License
MIT License - see LICENSE file.
🔗 Links
- Repository: https://github.com/intellistream/sageData
- Issues: https://github.com/intellistream/sageData/issues
❓ Common Issues
Q: Where's the LoCoMo data?
A: Run python -m locomo.download to download it (2.68MB from Hugging Face).
Q: How to download MMLU for offline use?
A: Run python -m mmlu.download --all-subjects to download all subjects (~160MB).
Q: GPQA access error?
A: You need to accept the dataset terms on Hugging Face: https://huggingface.co/datasets/Idavidrein/gpqa
Q: How to use Orca DPO for alignment research?
A: Use DataManager.get_by_source("orca_dpo") to get the loader, then use format_for_dpo() to prepare data for training.
Version: 0.1.0 | Last Updated: December 2025
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file isage_data-0.1.0-py3-none-any.whl.
File metadata
- Download URL: isage_data-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
054892d925561d2975d22e4e5246ab3c1a7f853e1a40df9e4a961148a4a2aa47
|
|
| MD5 |
948c50f3e6541201f8c29258dd7b7ee6
|
|
| BLAKE2b-256 |
cfd237f932d811934ab9ed346738b69915f607f17fa54172fedead95f41053b0
|