A library to download and process DataDecide datasets from Hugging Face.

Project description

DataDecide

DataDecide is a Python library for downloading, processing, and analyzing machine learning experiment data, specifically focusing on language model evaluation results.

Features

Data Pipeline: A multi-stage pipeline that downloads raw data from Hugging Face, processes it, and enriches it with additional details.
Easy Data Access: A simple interface to load and access various dataframes, including raw data, parsed data, and aggregated results.
Advanced Filtering: Multiple filter types including perplexity (ppl), OLMES metrics (olmes), and training steps (max_steps) with composable combinations.
Scripting Utilities: Powerful parameter and data selection with "all" keyword, exclusion lists, and intelligent validation for reproducible analysis scripts.
Native Plotting: Production-ready scaling analysis plots using dr_plotter integration.

Getting Started

Installation

To install the necessary dependencies, run:

uv sync
source .venv/bin/activate

To get dr_plotter:

uv sync --all-extras

# To update the github version
uv lock --upgrade-package dr_plotter

Usage

The main entry point to the library is the DataDecide class. Here's how to use it:

Basic Usage

from datadec import DataDecide

# Initialize the DataDecide class, which will run the data processing pipeline
dd = DataDecide(data_dir="./data")

# Access the full evaluation dataframe
full_eval_df = dd.full_eval

# Example of easy indexing
indexed_df = dd.easy_index_df(
    df_name="full_eval",
    data="C4",
    params="10M",
    seed=0,
)

print(indexed_df.head())

Advanced Filtering

# Filter data with multiple criteria
filtered_df = dd.get_filtered_df(
    filter_types=["ppl", "max_steps"],  # Remove NaN perplexity + apply step limits
    min_params="150M",                  # Only models 150M and larger
    verbose=True                        # Show filtering progress
)

# Filter by specific combinations only
olmes_only_df = dd.get_filtered_df(
    filter_types=["olmes"],            # Keep only rows with OLMES metrics
    return_means=False                 # Get individual seed results
)

Scripting Utilities

from datadec.script_utils import select_params, select_data

# Flexible parameter selection
params = select_params(["150M", "1B"])                    # Specific models
all_params = select_params("all")                          # All available (sorted)  
large_models = select_params("all", exclude=["4M", "6M"]) # All except smallest

# Data recipe selection  
data_recipes = select_data(["C4", "Dolma1.7"])           # Specific datasets
limited_data = select_data("all", exclude=["C4"])         # All except C4

print(f"Selected {len(params)} models: {params}")
print(f"Selected {len(data_recipes)} datasets: {data_recipes}")

Plotting

Generate scaling analysis plots using the native dr_plotter integration:

# Run the production plotting system
python scripts/plot_scaling_analysis.py

# Generates 7 different plot configurations in plots/test_plotting/

The notebooks/explore_data.py file provides a more detailed example of how to use the library.

Data

This library uses the following Hugging Face datasets:

allenai/DataDecide-ppl-results: Perplexity evaluation results.
allenai/DataDecide-eval-results: Downstream task evaluation results.

The data processing pipeline downloads these datasets and stores them in the data_dir specified during the DataDecide initialization. Then does some filtering, parsing, merging, and pulling in external information about hpms and other training settings.

Project Structure

├── src/datadec/           # Main library code
│   ├── data.py           # Main DataDecide class
│   ├── df_utils.py       # DataFrame utilities and filtering
│   ├── script_utils.py   # Parameter/data selection utilities
│   └── ...              # Pipeline, parsing, constants, etc.
├── scripts/               # Utilities and analysis scripts
│   ├── plot_scaling_analysis.py  # Production plotting system
│   └── legacy_deprecated/ # Archived legacy code
├── docs/                  # Documentation and reports
│   ├── processes/         # Templates and guides
│   └── reports/          # Project documentation
├── plots/                 # Generated visualizations
└── notebooks/            # Analysis notebooks

Development

See docs/processes/reporting_guide.md for project documentation standards and CLAUDE.md for development setup.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Sep 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadec-0.1.0.tar.gz (3.5 MB view details)

Uploaded Sep 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datadec-0.1.0-py3-none-any.whl (19.5 kB view details)

Uploaded Sep 8, 2025 Python 3

File details

Details for the file datadec-0.1.0.tar.gz.

File metadata

Download URL: datadec-0.1.0.tar.gz
Upload date: Sep 8, 2025
Size: 3.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for datadec-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`eea75f188b39e331f3e67dfd93620e22d2128667ae15fca3b3dce04e506d9fda`
MD5	`25c747d55c1cade72f7a8e3689e1469c`
BLAKE2b-256	`85ece364981d8511d9f97fb633f1ae8ee6075aeeab66baca04d18f84e84efe21`

See more details on using hashes here.

File details

Details for the file datadec-0.1.0-py3-none-any.whl.

File metadata

Download URL: datadec-0.1.0-py3-none-any.whl
Upload date: Sep 8, 2025
Size: 19.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for datadec-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4dc3548d651273ac128cf835fd2977ebc8631f5a0e4cd8a85eb215e84adb1fec`
MD5	`6097ca9cf0511365e1e78755e907b84a`
BLAKE2b-256	`5380abf42a73d81f37307a8e94ff3750505508caaab4a384407533fa6747d9c5`

See more details on using hashes here.

datadec 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

DataDecide

Features

Getting Started

Installation

Usage

Basic Usage

Advanced Filtering

Scripting Utilities

Plotting

Data

Project Structure

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes