Skip to main content

General framework for synthetic data generation

Project description

🎨 NeMo Data Designer

CI License Python 3.10 - 3.13 NeMo Microservices Code

Generate high-quality synthetic datasets from scratch or using your own seed data.


Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

  • Generate diverse data using statistical samplers, LLMs, or existing seed datasets
  • Control relationships between fields with dependency-aware generation
  • Validate quality with built-in Python, SQL, and custom local and remote validators
  • Score outputs using LLM-as-a-judge for quality assessment
  • Iterate quickly with preview mode before full-scale generation

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Get your API key from build.nvidia.com or OpenAI:

export NVIDIA_API_KEY="your-api-key-here"
# Or use OpenAI
export OPENAI_API_KEY="your-openai-api-key-here"

3. Generate your first dataset

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
)

# Initialize with default settings
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="""Write a brief product review for a {{ product_category }} item you recently purchased.""",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

That's it! You've created a dataset.


What's next?

📚 Learn more

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤝 Get involved


License

Apache License 2.0 – see LICENSE for details.


Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_designer-0.1.1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_designer-0.1.1-py3-none-any.whl (590.0 kB view details)

Uploaded Python 3

File details

Details for the file data_designer-0.1.1.tar.gz.

File metadata

  • Download URL: data_designer-0.1.1.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 189744b2e3e4a0a3cdd55adb4addb733eb636c6117e10022f0a82af4ea8cf316
MD5 d7087fd51ef61f5e1ee7f3849653f000
BLAKE2b-256 8f892c512c7ff2e8b1ed06a36049e199a8a7e48e2f261e6635235d8800babf0c

See more details on using hashes here.

File details

Details for the file data_designer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: data_designer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 590.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9964958cede40a5ad7be571932161c18ee98e22f97502f489929dbe93f1f5f3a
MD5 83f0ab11f4057bda1f8c516cb9beff2e
BLAKE2b-256 4e142a93dd17b9156cf4466b4d7e234a1d6d162e237da90c9d76a511721e4057

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page