Skip to main content

Build, refine and validate AI safety datasets using LLM pipelines

Project description

Dataset Foundry

A toolkit for building validated datasets.

Dataset Foundry uses the concept of data pipelines to load, generate or validate datasets. A pipeline is a sequence of actions executed either against the dataset itself or the individual items in the dataset.

For details on which actions are supported, see the actions documentation.

Installation

  1. Install the package:

    pip install dataset-foundry
    
  2. Create a .env file in the project root:

    # Provider API keys
    OPENAI_API_KEY=
    ANTHROPIC_API_KEY=
    
    # Command-line defaults
    DF_MODEL=anthropic/claude-sonnet-4-20250514
    
    # Keep full display open after finishing so you can browse the results
    # DF_NO_EXIT=true
    

Default Settings

  • Default Model: gpt-4o-mini
  • Default Temperature: 0.7
  • Default Number of Samples: 10
  • Dataset Directory: ./datasets
  • Logs Directory: ./logs

Project Structure

dataset-foundry/
├── src/
│   └── dataset_foundry/
│       ├── actions/       # Actions for processing datasets and items within those datasets
│       ├── cli/           # Command-line interface tools
│       ├── core/          # Core functionality
│       └── utils/         # Utility functions
├── datasets/              # Generated datasets
├── examples/              # Example pipelines
│   └── refactorable_code/ # Example pipelines to build a dataset of code requiring refactoring
└── logs/                  # Operation logs

Running Pipelines

Pipelines can be run from the command line using the dataset-foundry command.

dataset-foundry <pipeline_module> <dataset_name>

For example, to run the generate_spec pipeline to create specs for a dataset saved to datasets/dataset1, you would use:

dataset-foundry examples/refactorable_code/generate_spec/pipeline.py dataset1

Use dataset-foundry --help to see available arguments.

Running Examples

To generate a set of specs for a dataset named o3v5, you would use:

dataset-foundry examples/refactorable_code/generate_spec/pipeline.py samples --num-samples=2

To generate a set of functions and unit testsfrom the specs, you would use:

dataset-foundry examples/refactorable_code/generate_all_from_spec/pipeline.py samples

To run the unit tests for the generated functions, you would use:

dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples

If some of the unit tests fail, you can regenerate them by running:

dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples

Variable Substitutions

Variable substitutions allows you to use variables in your prompts and in certain parameters passed into pipeline actions.

Prompt templates and certain parameters are parsed as f-strings, with the following enhancements:

  • Dotted references are supported and resolve both dictionary keys or object attributes. For instance, {spec.name} will return the value of spec['name'] if spec is a dictionary, or the value of spec.name if spec is an object.
  • Formatters can be specified after a colon. For example, {spec:yaml} will return the spec object formatted as a YAML string. Supported formatters include: yaml, json, upper, lower.

For instance, if an item is being processed with an id of 123 and a spec dictionary with a name key of my_function, the following will save the code property of the item as a file named item_123_my_function.py:

   ...
   save_item(contents=Key("code"), filename="item_{id}_{spec.name}.py"),
   ...

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_foundry-0.5.0.tar.gz (138.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_foundry-0.5.0-py3-none-any.whl (103.6 kB view details)

Uploaded Python 3

File details

Details for the file dataset_foundry-0.5.0.tar.gz.

File metadata

  • Download URL: dataset_foundry-0.5.0.tar.gz
  • Upload date:
  • Size: 138.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataset_foundry-0.5.0.tar.gz
Algorithm Hash digest
SHA256 0c4d1e33b45dfcdb539bfb5f3c2a2d7d7c111c26d73b00ebb0f4d3e86b1094b3
MD5 d5ef6b2f333a15557b37d109e251398b
BLAKE2b-256 3b9cdf3722cc034775eeb247bdc56d25c2f2ca056636a59ed31e534235eb724d

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_foundry-0.5.0.tar.gz:

Publisher: publish.yml on fastfedora/dataset-foundry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataset_foundry-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: dataset_foundry-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 103.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataset_foundry-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c6245fc34a9d97757393c6fb1cfae5c5751055ed1118a9018284eb2221b0ac5
MD5 35c5ff54c2675795c33026cac4a63e5c
BLAKE2b-256 781f190e2d79d3cb7a3e37854f2b0ade62fedb2faf0b41b605ea324a18649b1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_foundry-0.5.0-py3-none-any.whl:

Publisher: publish.yml on fastfedora/dataset-foundry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page