Skip to main content

Build, refine and validate AI safety datasets using LLM pipelines

Project description

Dataset Foundry

A toolkit for building validated datasets.

Dataset Foundry uses the concept of data pipelines to load, generate or validate datasets. A pipeline is a sequence of actions executed either against the dataset itself or the individual items in the dataset.

For details on which actions are supported, see the actions documentation.

Setup

  1. Clone the repository
  2. Create a virtual environment:
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install the package:
    pip install -e .
    
  4. Create a .env file in the project root with your OpenAI API key:
    OPENAI_API_KEY=your_api_key_here
    

Default Settings

  • Default Model: gpt-4o-mini
  • Default Temperature: 0.7
  • Default Number of Samples: 10
  • Dataset Directory: ./datasets
  • Logs Directory: ./logs

Project Structure

dataset-foundry/
├── src/
│   └── dataset_foundry/
│       ├── actions/       # Actions for processing datasets and items within those datasets
│       ├── cli/           # Command-line interface tools
│       ├── core/          # Core functionality
│       └── utils/         # Utility functions
├── datasets/              # Generated datasets
├── examples/              # Example pipelines
│   └── refactorable_code/ # Example pipelines to build a dataset of code requiring refactoring
└── logs/                  # Operation logs

Running Pipelines

Pipelines can be run from the command line using the dataset-foundry command.

dataset-foundry <pipeline_module> <dataset_name>

For example, to run the generate_spec pipeline to create specs for a dataset saved to datasets/dataset1, you would use:

dataset-foundry examples/refactorable_code/generate_spec/pipeline.py dataset1

Use dataset-foundry --help to see available arguments.

Running Examples

To generate a set of specs for a dataset named o3v5, you would use:

dataset-foundry examples/refactorable_code/generate_spec/pipeline.py samples --num-samples=2

To generate a set of functions and unit testsfrom the specs, you would use:

dataset-foundry examples/refactorable_code/generate_all_from_spec/pipeline.py samples

To run the unit tests for the generated functions, you would use:

dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples

If some of the unit tests fail, you can regenerate them by running:

dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples

Variable Substitutions

Variable substitutions allows you to use variables in your prompts and in certain parameters passed into pipeline actions.

Prompt templates and certain parameters are parsed as f-strings, with the following enhancements:

  • Dotted references are supported and resolve both dictionary keys or object attributes. For instance, {spec.name} will return the value of spec['name'] if spec is a dictionary, or the value of spec.name if spec is an object.
  • Formatters can be specified after a colon. For example, {spec:yaml} will return the spec object formatted as a YAML string. Supported formatters include: yaml, json, upper, lower.

For instance, if an item is being processed with an id of 123 and a spec dictionary with a name key of my_function, the following will save the code property of the item as a file named item_123_my_function.py:

   ...
   save_item(contents=Key("code"), filename="item_{id}_{spec.name}.py"),
   ...

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_foundry-0.2.0.tar.gz (130.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_foundry-0.2.0-py3-none-any.whl (97.3 kB view details)

Uploaded Python 3

File details

Details for the file dataset_foundry-0.2.0.tar.gz.

File metadata

  • Download URL: dataset_foundry-0.2.0.tar.gz
  • Upload date:
  • Size: 130.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataset_foundry-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6544186854565d54e263fb54ff6a111c50c992fae61dd5bdc5e54cc5f4ad6afe
MD5 8f48e1ed632de2b794f8d0ab31a3d302
BLAKE2b-256 f95248c657700560956ad48a2dfc4db900746616c7c5a58ee5ca54ea80b3d0d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_foundry-0.2.0.tar.gz:

Publisher: publish.yml on fastfedora/dataset-foundry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataset_foundry-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataset_foundry-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e35229424c42883cf80e8efe2a8d39b41bb853b9655f46ca45d7c220a95d05b8
MD5 64fee93566207679237ccf7687859896
BLAKE2b-256 1e5e21d78ba848d24e50f813baa32facfd8bac4d11c3ce853437bfc764b9c3c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_foundry-0.2.0-py3-none-any.whl:

Publisher: publish.yml on fastfedora/dataset-foundry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page