Skip to main content

Build, refine and validate AI safety datasets using LLM pipelines

Project description

Dataset Foundry

A toolkit for building validated datasets.

Dataset Foundry uses the concept of data pipelines to load, generate or validate datasets. A pipeline is a sequence of actions executed either against the dataset itself or the individual items in the dataset.

For details on which actions are supported, see the actions documentation.

Setup

  1. Clone the repository
  2. Create a virtual environment:
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install the package:
    pip install -e .
    
  4. Create a .env file in the project root with your OpenAI API key:
    OPENAI_API_KEY=your_api_key_here
    

Default Settings

  • Default Model: gpt-4o-mini
  • Default Temperature: 0.7
  • Default Number of Samples: 10
  • Dataset Directory: ./datasets
  • Logs Directory: ./logs

Project Structure

dataset-foundry/
├── src/
│   └── dataset_foundry/
│       ├── actions/       # Actions for processing datasets and items within those datasets
│       ├── cli/           # Command-line interface tools
│       ├── core/          # Core functionality
│       └── utils/         # Utility functions
├── datasets/              # Generated datasets
├── examples/              # Example pipelines
│   └── refactorable_code/ # Example pipelines to build a dataset of code requiring refactoring
└── logs/                  # Operation logs

Running Pipelines

Pipelines can be run from the command line using the dataset-foundry command.

dataset-foundry <pipeline_module> <dataset_name>

For example, to run the generate_spec pipeline to create specs for a dataset saved to datasets/dataset1, you would use:

dataset-foundry examples/refactorable_code/generate_spec/pipeline.py dataset1

Use dataset-foundry --help to see available arguments.

Running Examples

To generate a set of specs for a dataset named o3v5, you would use:

dataset-foundry examples/refactorable_code/generate_spec/pipeline.py samples --num-samples=2

To generate a set of functions and unit testsfrom the specs, you would use:

dataset-foundry examples/refactorable_code/generate_all_from_spec/pipeline.py samples

To run the unit tests for the generated functions, you would use:

dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples

If some of the unit tests fail, you can regenerate them by running:

dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples

Variable Substitutions

Variable substitutions allows you to use variables in your prompts and in certain parameters passed into pipeline actions.

Prompt templates and certain parameters are parsed as f-strings, with the following enhancements:

  • Dotted references are supported and resolve both dictionary keys or object attributes. For instance, {spec.name} will return the value of spec['name'] if spec is a dictionary, or the value of spec.name if spec is an object.
  • Formatters can be specified after a colon. For example, {spec:yaml} will return the spec object formatted as a YAML string. Supported formatters include: yaml, json, upper, lower.

For instance, if an item is being processed with an id of 123 and a spec dictionary with a name key of my_function, the following will save the code property of the item as a file named item_123_my_function.py:

   ...
   save_item(contents=Key("code"), filename="item_{id}_{spec.name}.py"),
   ...

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_foundry-0.4.0.tar.gz (133.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_foundry-0.4.0-py3-none-any.whl (101.5 kB view details)

Uploaded Python 3

File details

Details for the file dataset_foundry-0.4.0.tar.gz.

File metadata

  • Download URL: dataset_foundry-0.4.0.tar.gz
  • Upload date:
  • Size: 133.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataset_foundry-0.4.0.tar.gz
Algorithm Hash digest
SHA256 9e6c5c8f3080966c3ae81bf0c8421b3ce95cd3a83fadf427c34f7b3eb4dbef94
MD5 d76fe69cd2522293b51dbd452c6a80b9
BLAKE2b-256 9b55aa9c5a5cc1c756fbe23c16629ef3b737ebd60c1f8c0558fb419feeb17239

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_foundry-0.4.0.tar.gz:

Publisher: publish.yml on fastfedora/dataset-foundry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataset_foundry-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dataset_foundry-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 101.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataset_foundry-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33676c35f17367700848875c16338c11acc1cc575a54b1ee29b6530b0f7fadcc
MD5 9f74c215f620747bc6cc0c11000085c7
BLAKE2b-256 3f5cffb70f9630411d1c238fcd674b9c05137e19c678f69b047a5745f78eef48

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_foundry-0.4.0-py3-none-any.whl:

Publisher: publish.yml on fastfedora/dataset-foundry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page