Skip to main content

General framework for synthetic data generation

Project description

🎨 NeMo Data Designer

CI License Python 3.10 - 3.13 NeMo Microservices Code Tokens

Generate high-quality synthetic datasets from scratch or using your own seed data.


Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

  • Generate diverse data using statistical samplers, LLMs, or existing seed datasets
  • Control relationships between fields with dependency-aware generation
  • Validate quality with built-in Python, SQL, and custom local and remote validators
  • Score outputs using LLM-as-a-judge for quality assessment
  • Iterate quickly with preview mode before full-scale generation

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Start with one of our default model providers:

Grab your API key(s) using the above links and set one or more of the following environment variables:

export NVIDIA_API_KEY="your-api-key-here"

export OPENAI_API_KEY="your-openai-api-key-here"

export OPENROUTER_API_KEY="your-openrouter-api-key-here"

3. Start generating data!

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Initialize with default settings
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="Write a brief product review for a {{ product_category }} item you recently purchased.",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

What's next?

📚 Learn more

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤝 Get involved


Telemetry

Data Designer collects telemetry to help us improve the library for developers. We collect:

  • The names of models used
  • The count of input tokens
  • The count of output tokens

No user or device information is collected. This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.

Specifically, a model name that is defined a ModelConfig object, is what will be collected. In the below example config:

ModelConfig(
    alias="nv-reasoning",
    model="openai/gpt-oss-20b",
    provider="nvidia",
    inference_parameters=ChatCompletionInferenceParams(
        temperature=0.3,
        top_p=0.9,
        max_tokens=4096,
    ),
)

The value openai/gpt-oss-20b would be collected.

To disable telemetry capture, set NEMO_TELEMETRY_ENABLED=false.

Top Models

This chart represents the breakdown of models used for Data Designer across all synthetic data generation jobs from 2/23/2026 to 3/23/2026.

Top models used for synthetic data generation

Last updated on 3/23/2026


License

Apache License 2.0 – see LICENSE for details.


Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team, NVIDIA},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_designer-0.5.4.tar.gz (117.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_designer-0.5.4-py3-none-any.whl (98.0 kB view details)

Uploaded Python 3

File details

Details for the file data_designer-0.5.4.tar.gz.

File metadata

  • Download URL: data_designer-0.5.4.tar.gz
  • Upload date:
  • Size: 117.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.5.4.tar.gz
Algorithm Hash digest
SHA256 43f22013e82aadeb78bca12104dbe41bcd360887cfa7c8cf6c29187653da18c8
MD5 4de8e40feef066efc8b402384a37d148
BLAKE2b-256 7e0623e4c35de23fd9e1c0888df7f4d1b5ba03518c92f7286593d68bebc4e475

See more details on using hashes here.

File details

Details for the file data_designer-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: data_designer-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 98.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1efcd7de15b1a8dd192c3bf8810154f113bf4600b076c5d3eb002841cf4f8abd
MD5 5681c171e4180b42ef563501ccf07f6e
BLAKE2b-256 b0ce5986082be26772a69a53ff5c6164f36da291371908a84a98673df19c10e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page