Skip to main content

A library to download and process DataDecide datasets from Hugging Face.

Project description

DataDecide

DataDecide is a Python library for downloading, processing, and analyzing machine learning experiment data, specifically focusing on language model evaluation results.

Features

  • Data Pipeline: A multi-stage pipeline that downloads raw data from Hugging Face, processes it, and enriches it with additional details.
  • Easy Data Access: A simple interface to load and access various dataframes, including raw data, parsed data, and aggregated results.
  • Advanced Filtering: Multiple filter types including perplexity (ppl), OLMES metrics (olmes), and training steps (max_steps) with composable combinations.
  • Scripting Utilities: Powerful parameter and data selection with "all" keyword, exclusion lists, and intelligent validation for reproducible analysis scripts.
  • Native Plotting: Production-ready scaling analysis plots using dr_plotter integration.

Getting Started

Installation

To install the necessary dependencies, run:

uv sync
source .venv/bin/activate

To get dr_plotter:

uv sync --all-extras

# To update the github version
uv lock --upgrade-package dr_plotter

Usage

The main entry point to the library is the DataDecide class. Here's how to use it:

Basic Usage

from datadec import DataDecide

# Initialize the DataDecide class, which will run the data processing pipeline
dd = DataDecide(data_dir="./data")

# Access the full evaluation dataframe
full_eval_df = dd.full_eval

# Example of easy indexing
indexed_df = dd.easy_index_df(
    df_name="full_eval",
    data="C4",
    params="10M",
    seed=0,
)

print(indexed_df.head())

Advanced Filtering

# Filter data with multiple criteria
filtered_df = dd.get_filtered_df(
    filter_types=["ppl", "max_steps"],  # Remove NaN perplexity + apply step limits
    min_params="150M",                  # Only models 150M and larger
    verbose=True                        # Show filtering progress
)

# Filter by specific combinations only
olmes_only_df = dd.get_filtered_df(
    filter_types=["olmes"],            # Keep only rows with OLMES metrics
    return_means=False                 # Get individual seed results
)

Scripting Utilities

from datadec.script_utils import select_params, select_data

# Flexible parameter selection
params = select_params(["150M", "1B"])                    # Specific models
all_params = select_params("all")                          # All available (sorted)  
large_models = select_params("all", exclude=["4M", "6M"]) # All except smallest

# Data recipe selection  
data_recipes = select_data(["C4", "Dolma1.7"])           # Specific datasets
limited_data = select_data("all", exclude=["C4"])         # All except C4

print(f"Selected {len(params)} models: {params}")
print(f"Selected {len(data_recipes)} datasets: {data_recipes}")

Plotting

Generate scaling analysis plots using the native dr_plotter integration:

# Run the production plotting system
python scripts/plot_scaling_analysis.py

# Generates 7 different plot configurations in plots/test_plotting/

The notebooks/explore_data.py file provides a more detailed example of how to use the library.

Data

This library uses the following Hugging Face datasets:

The data processing pipeline downloads these datasets and stores them in the data_dir specified during the DataDecide initialization. Then does some filtering, parsing, merging, and pulling in external information about hpms and other training settings.

Project Structure

├── src/datadec/           # Main library code
│   ├── data.py           # Main DataDecide class
│   ├── df_utils.py       # DataFrame utilities and filtering
│   ├── script_utils.py   # Parameter/data selection utilities
│   └── ...              # Pipeline, parsing, constants, etc.
├── scripts/               # Utilities and analysis scripts
│   ├── plot_scaling_analysis.py  # Production plotting system
│   └── legacy_deprecated/ # Archived legacy code
├── docs/                  # Documentation and reports
│   ├── processes/         # Templates and guides
│   └── reports/          # Project documentation
├── plots/                 # Generated visualizations
└── notebooks/            # Analysis notebooks

Development

See docs/processes/reporting_guide.md for project documentation standards and CLAUDE.md for development setup.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadec-0.1.0.tar.gz (3.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadec-0.1.0-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file datadec-0.1.0.tar.gz.

File metadata

  • Download URL: datadec-0.1.0.tar.gz
  • Upload date:
  • Size: 3.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for datadec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 eea75f188b39e331f3e67dfd93620e22d2128667ae15fca3b3dce04e506d9fda
MD5 25c747d55c1cade72f7a8e3689e1469c
BLAKE2b-256 85ece364981d8511d9f97fb633f1ae8ee6075aeeab66baca04d18f84e84efe21

See more details on using hashes here.

File details

Details for the file datadec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datadec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for datadec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4dc3548d651273ac128cf835fd2977ebc8631f5a0e4cd8a85eb215e84adb1fec
MD5 6097ca9cf0511365e1e78755e907b84a
BLAKE2b-256 5380abf42a73d81f37307a8e94ff3750505508caaab4a384407533fa6747d9c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page