A library to download and process DataDecide datasets from Hugging Face.
Project description
DataDecide
DataDecide is a Python library for downloading, processing, and analyzing machine learning experiment data, specifically focusing on language model evaluation results.
Features
- Data Pipeline: A multi-stage pipeline that downloads raw data from Hugging Face, processes it, and enriches it with additional details.
- Easy Data Access: A simple interface to load and access various dataframes, including raw data, parsed data, and aggregated results.
- Advanced Filtering: Multiple filter types including perplexity (
ppl), OLMES metrics (olmes), and training steps (max_steps) with composable combinations. - Scripting Utilities: Powerful parameter and data selection with
"all"keyword, exclusion lists, and intelligent validation for reproducible analysis scripts. - Native Plotting: Production-ready scaling analysis plots using dr_plotter integration.
Getting Started
Installation
To install the necessary dependencies, run:
uv sync
source .venv/bin/activate
To get dr_plotter:
uv sync --all-extras
# To update the github version
uv lock --upgrade-package dr_plotter
Usage
The main entry point to the library is the DataDecide class. Here's how to use it:
Basic Usage
from datadec import DataDecide
# Initialize the DataDecide class, which will run the data processing pipeline
dd = DataDecide(data_dir="./data")
# Access the full evaluation dataframe
full_eval_df = dd.full_eval
# Example of easy indexing
indexed_df = dd.easy_index_df(
df_name="full_eval",
data="C4",
params="10M",
seed=0,
)
print(indexed_df.head())
Advanced Filtering
# Filter data with multiple criteria
filtered_df = dd.get_filtered_df(
filter_types=["ppl", "max_steps"], # Remove NaN perplexity + apply step limits
min_params="150M", # Only models 150M and larger
verbose=True # Show filtering progress
)
# Filter by specific combinations only
olmes_only_df = dd.get_filtered_df(
filter_types=["olmes"], # Keep only rows with OLMES metrics
return_means=False # Get individual seed results
)
Scripting Utilities
from datadec.script_utils import select_params, select_data
# Flexible parameter selection
params = select_params(["150M", "1B"]) # Specific models
all_params = select_params("all") # All available (sorted)
large_models = select_params("all", exclude=["4M", "6M"]) # All except smallest
# Data recipe selection
data_recipes = select_data(["C4", "Dolma1.7"]) # Specific datasets
limited_data = select_data("all", exclude=["C4"]) # All except C4
print(f"Selected {len(params)} models: {params}")
print(f"Selected {len(data_recipes)} datasets: {data_recipes}")
Plotting
Generate scaling analysis plots using the native dr_plotter integration:
# Run the production plotting system
python scripts/plot_scaling_analysis.py
# Generates 7 different plot configurations in plots/test_plotting/
The notebooks/explore_data.py file provides a more detailed example of how to use the library.
Data
This library uses the following Hugging Face datasets:
- allenai/DataDecide-ppl-results: Perplexity evaluation results.
- allenai/DataDecide-eval-results: Downstream task evaluation results.
The data processing pipeline downloads these datasets and stores them in the data_dir specified during the DataDecide initialization. Then does some filtering, parsing, merging, and pulling in external information about hpms and other training settings.
Project Structure
├── src/datadec/ # Main library code
│ ├── data.py # Main DataDecide class
│ ├── df_utils.py # DataFrame utilities and filtering
│ ├── script_utils.py # Parameter/data selection utilities
│ └── ... # Pipeline, parsing, constants, etc.
├── scripts/ # Utilities and analysis scripts
│ ├── plot_scaling_analysis.py # Production plotting system
│ └── legacy_deprecated/ # Archived legacy code
├── docs/ # Documentation and reports
│ ├── processes/ # Templates and guides
│ └── reports/ # Project documentation
├── plots/ # Generated visualizations
└── notebooks/ # Analysis notebooks
Development
See docs/processes/reporting_guide.md for project documentation standards and CLAUDE.md for development setup.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datadec-0.1.0.tar.gz.
File metadata
- Download URL: datadec-0.1.0.tar.gz
- Upload date:
- Size: 3.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eea75f188b39e331f3e67dfd93620e22d2128667ae15fca3b3dce04e506d9fda
|
|
| MD5 |
25c747d55c1cade72f7a8e3689e1469c
|
|
| BLAKE2b-256 |
85ece364981d8511d9f97fb633f1ae8ee6075aeeab66baca04d18f84e84efe21
|
File details
Details for the file datadec-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datadec-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dc3548d651273ac128cf835fd2977ebc8631f5a0e4cd8a85eb215e84adb1fec
|
|
| MD5 |
6097ca9cf0511365e1e78755e907b84a
|
|
| BLAKE2b-256 |
5380abf42a73d81f37307a8e94ff3750505508caaab4a384407533fa6747d9c5
|