Skip to main content

General framework for synthetic data generation

Project description

🎨 NeMo Data Designer

CI License Python 3.10+ NeMo Microservices

Generate high-quality synthetic datasets from scratch or using your own seed data.


Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

  • Generate diverse data using statistical samplers, LLMs, or existing seed datasets
  • Control relationships between fields with dependency-aware generation
  • Validate quality with built-in Python, SQL, and custom local and remote validators
  • Score outputs using LLM-as-a-judge for quality assessment
  • Iterate quickly with preview mode before full-scale generation

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Get your API key from build.nvidia.com or OpenAI:

export NVIDIA_API_KEY="your-api-key-here"
# Or use OpenAI
export OPENAI_API_KEY="your-openai-api-key-here"

3. Generate your first dataset

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
)

# Initialize with default settings
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="""Write a brief product review for a {{ product_category }} item you recently purchased.""",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

That's it! You've created a dataset.


What's next?

📚 Learn more

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤝 Get involved


License

Apache License 2.0 – see LICENSE for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_designer-0.1.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_designer-0.1.0-py3-none-any.whl (589.1 kB view details)

Uploaded Python 3

File details

Details for the file data_designer-0.1.0.tar.gz.

File metadata

  • Download URL: data_designer-0.1.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 81b56106f78ca24287b523bdb64fbfa6127c69832d2b2c31a934bc72a603d335
MD5 6c054963e3840458214d839009ae8eae
BLAKE2b-256 3b5854d9a4fd6dce61d82d6853614a66e3f06223346f7aff8cb5503ffa4ec585

See more details on using hashes here.

File details

Details for the file data_designer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: data_designer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 589.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for data_designer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b9ee34686b7250ae096c877862f77e4cfe9920b3a1c5e819adde7293997c4fe
MD5 8b57f9026776f9e236cf739455dc07f1
BLAKE2b-256 e2dac6d54d217108de5f99ecb2ed799619337976429f44c29f6723b4bc988d6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page