General framework for synthetic data generation

These details have not been verified by PyPI

Project description

🎨 NeMo Data Designer

Tokens

Generate high-quality synthetic datasets from scratch or using your own seed data.

Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

Generate diverse data using statistical samplers, LLMs, or existing seed datasets
Control relationships between fields with dependency-aware generation
Validate quality with built-in Python, SQL, and custom local and remote validators
Score outputs using LLM-as-a-judge for quality assessment
Iterate quickly with preview mode before full-scale generation

⚠️ Security Notice: LiteLLM Supply-Chain Incident (2026-03-24)

On March 24, 2026, malicious versions of litellm (1.82.7 and 1.82.8) were published to PyPI containing a credential stealer. The compromised packages were available for approximately five hours (10:39 – 16:00 UTC) before being removed.

The only Data Designer releases that could resolve to these versions are v0.2.2 (Dec 2025) and v0.2.3 (Jan 2026), which carried a looser litellm<2 upper bound. These are nearly three months old and have been superseded by eight subsequent releases — both have been yanked from PyPI as a precaution. All other releases (v0.3.0 – v0.5.3) pinned litellm to >=1.73.6,<1.80.12 and were never compatible with 1.82.x. Starting with v0.5.4, litellm is no longer a dependency.

To have been impacted through Data Designer, you would need to have had one of these two old versions explicitly pinned and run a fresh pip install or dependency-cache update that resolved litellm during the five-hour window on March 24. If you believe you may be affected, see BerriAI's incident report for remediation steps.

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Start with one of our default model providers:

Grab your API key(s) using the above links and set one or more of the following environment variables:

export NVIDIA_API_KEY="your-api-key-here"

export OPENAI_API_KEY="your-openai-api-key-here"

export OPENROUTER_API_KEY="your-openrouter-api-key-here"

3. Start generating data!

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Initialize with default settings
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="Write a brief product review for a {{ product_category }} item you recently purchased.",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

What's next?

📚 Learn more

Getting Started – Install, configure, and generate your first dataset
Tutorial Notebooks – Step-by-step interactive tutorials
Column Types – Explore samplers, LLM columns, validators, and more
Validators – Learn how to validate generated data with Python, SQL, and remote validators
Model Configuration – Configure custom models and providers
Person Sampling – Learn how to sample realistic person data with demographic attributes

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤖 Agent Skill

Data Designer has a skill for coding agents. Just describe the dataset you want, and your agent handles schema design, validation, and generation. While the skill should work with other coding agents that support skills, our development and testing has focused on Claude Code at this stage.

Install via skills.sh (be sure to select Claude Code as an additional agent):

npx skills add NVIDIA-NeMo/DataDesigner

After installation, type /data-designer or describe the dataset you want and the skill will kick in.

🤝 Get involved

This repository supports agent-assisted development — see CONTRIBUTING.md for the recommended workflow.

Contributing Guide – How to contribute, including agent-assisted workflows
GitHub Issues – Report bugs or make a feature request

Telemetry

Data Designer collects telemetry to help us improve the library for developers. We collect:

The names of models used
The count of input tokens
The count of output tokens

No user or device information is collected. This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.

Specifically, a model name that is defined a ModelConfig object, is what will be collected. In the below example config:

ModelConfig(
    alias="nv-reasoning",
    model="openai/gpt-oss-20b",
    provider="nvidia",
    inference_parameters=ChatCompletionInferenceParams(
        temperature=0.3,
        top_p=0.9,
        max_tokens=4096,
    ),
)

The value openai/gpt-oss-20b would be collected.

To disable telemetry capture, set NEMO_TELEMETRY_ENABLED=false.

Top Models

This chart represents the breakdown of models used for Data Designer across all synthetic data generation jobs from 2/23/2026 to 3/23/2026.

Top models used for synthetic data generation

Last updated on 3/23/2026

License

Apache License 2.0 – see LICENSE for details.

Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team, NVIDIA},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.9

Apr 28, 2026

0.5.8

Apr 27, 2026

0.5.8rc2 pre-release

Apr 27, 2026

0.5.8rc1 pre-release

Apr 24, 2026

0.5.7

Apr 17, 2026

0.5.7rc1 pre-release

Apr 17, 2026

0.5.6

Apr 9, 2026

This version

0.5.6rc1 pre-release

Apr 9, 2026

0.5.5

Apr 2, 2026

0.5.5rc1 pre-release

Apr 2, 2026

0.5.4

Mar 25, 2026

0.5.4rc4 pre-release

Mar 25, 2026

0.5.4rc3 pre-release

Mar 19, 2026

0.5.4rc2 pre-release

Mar 17, 2026

0.5.4rc1 pre-release

Mar 15, 2026

0.5.3

Mar 12, 2026

0.5.3rc4 pre-release

Mar 12, 2026

0.5.3rc3 pre-release

Mar 12, 2026

0.5.3rc2 pre-release

Mar 12, 2026

0.5.3rc1 pre-release

Mar 12, 2026

0.5.2

Mar 5, 2026

0.5.1

Feb 20, 2026

0.5.1rc1 pre-release

Feb 20, 2026

0.5.0

Feb 11, 2026

0.5.0rc4 pre-release

Feb 11, 2026

0.5.0rc3 pre-release

Feb 10, 2026

0.5.0rc2 pre-release

Feb 5, 2026

0.5.0rc1 pre-release

Feb 3, 2026

0.4.0

Jan 31, 2026

0.4.0rc3 pre-release

Jan 31, 2026

0.4.0rc2 pre-release

Jan 29, 2026

0.4.0rc1 pre-release

Jan 28, 2026

0.3.8

Jan 27, 2026

0.3.8rc2 pre-release

Jan 26, 2026

0.3.8rc1 pre-release

Jan 21, 2026

0.3.7

Jan 17, 2026

0.3.6

Jan 17, 2026

0.3.5

Jan 16, 2026

0.3.4

Jan 14, 2026

0.3.3

Jan 12, 2026

0.3.2

Jan 9, 2026

0.3.1

Jan 8, 2026

0.3.0

Jan 8, 2026

0.2.3 yanked

Jan 7, 2026

Reason this release was yanked:

Potential exposure to litellm v1.82.8

0.2.2 yanked

Dec 30, 2025

Reason this release was yanked:

Potential exposure to litellm v1.82.8

0.2.1

Dec 19, 2025

0.2.0

Dec 17, 2025

0.1.5

Dec 11, 2025

0.1.4

Dec 8, 2025

0.1.3

Dec 3, 2025

0.1.2

Nov 24, 2025

0.1.1

Nov 21, 2025

0.1.0

Nov 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_designer-0.5.6rc1.tar.gz (119.6 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_designer-0.5.6rc1-py3-none-any.whl (99.0 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file data_designer-0.5.6rc1.tar.gz.

File metadata

Download URL: data_designer-0.5.6rc1.tar.gz
Upload date: Apr 9, 2026
Size: 119.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.5.6rc1.tar.gz
Algorithm	Hash digest
SHA256	`1873113ace58ebc58919c44cd86a9dfc4737c0fabb774ca1454f6047f77b8cfb`
MD5	`d6a528be6bd45c9f8e2df1885f77deb4`
BLAKE2b-256	`aa6398dc7cd85f61e7c53fd7a32262615ba749ad57c404d02225f3b9475faf1e`

See more details on using hashes here.

File details

Details for the file data_designer-0.5.6rc1-py3-none-any.whl.

File metadata

Download URL: data_designer-0.5.6rc1-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 99.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.5.6rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`639924f6dcc7d4eb0a73684f5a7ad23c6fb82630202ece30ca0508ffed0ccf04`
MD5	`9b7e76382f592c722ed10a79c06788bc`
BLAKE2b-256	`5e378b06426884707c3396333d3cfab95b35346f5f866c3f99b2aadf62c1e9e7`

See more details on using hashes here.

data-designer 0.5.6rc1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

🎨 NeMo Data Designer

Welcome!

What can you do with Data Designer?

⚠️ Security Notice: LiteLLM Supply-Chain Incident (2026-03-24)

Quick Start

1. Install

2. Set your API key

3. Start generating data!

What's next?

📚 Learn more

🔧 Configure models via CLI

🤖 Agent Skill

🤝 Get involved

Telemetry

Top Models

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes