Skip to main content

Generate synthetic training data for ML pipelines. Q&A pairs, classification examples, tabular data, and instruction-following datasets. Anthropic-powered.

Project description

synthetic-data-gen

Generate synthetic training data for ML pipelines — Q&A pairs, classification examples, tabular data, and instruction-following datasets.

Tests Python License LinkedIn

Install

pip install synthetic-data-gen

Requires ANTHROPIC_API_KEY environment variable.

Quick start

from synth_data import SynthDataGen

gen = SynthDataGen()

# Q&A pairs from your corpus
qa = gen.qa_pairs(context="The UK AI Safety Institute was founded in 2023...", n=10)
qa.save("qa_train.jsonl")

# Classification examples
examples = gen.classification(
    labels=["compliant", "non_compliant", "requires_review"],
    domain="UK GDPR data processing records",
    n=60,
)
examples.save("gdpr_train.csv", format="csv")

# Instruction-following dataset
dataset = gen.instructions(
    task_description="Summarise UK government policy documents",
    n=30,
)
print(dataset.to_alpaca())  # Alpaca fine-tuning format

# Tabular synthetic data
employees = gen.tabular(
    columns=["name", "department", "grade", "salary"],
    schema={"grade": "one of: EO, HEO, SEO, G7, G6", "salary": "integer 25000-120000"},
    domain="UK civil service",
    n=100,
)
employees.save("workforce.csv", format="csv")

Export formats

dataset.to_json()    # pretty-printed JSON
dataset.to_jsonl()   # one object per line (HuggingFace format)
dataset.to_csv()     # CSV with headers
dataset.to_alpaca()  # Alpaca instruction-tuning format
dataset.save("file.jsonl", format="jsonl")

Linda Oraegbunam | LinkedIn | GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic_dataset_gen-1.0.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthetic_dataset_gen-1.0.0-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file synthetic_dataset_gen-1.0.0.tar.gz.

File metadata

  • Download URL: synthetic_dataset_gen-1.0.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for synthetic_dataset_gen-1.0.0.tar.gz
Algorithm Hash digest
SHA256 22e330d76215925a98c523de8b07aca5bbb53603d192089d271ff9bc3f7289ab
MD5 679d3ca53dfeb4ce3963bc800d4f3d3c
BLAKE2b-256 ec9153a5cb576d07de1eee901f7a0c01ce9f84fc7748b04984be65871581ed45

See more details on using hashes here.

Provenance

The following attestation bundles were made for synthetic_dataset_gen-1.0.0.tar.gz:

Publisher: publish.yml on obielin/synthetic-data-gen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file synthetic_dataset_gen-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_dataset_gen-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df43decaa682d22b64aec15cb941b291760abec2af6af973fa1d1649451fb5d8
MD5 ee7504aec66d92a6436ba1f3249f27a4
BLAKE2b-256 e7098f7076bd435bf6ffee4f9c3ae9e8174fa91807a12d37ebfb608b3882287b

See more details on using hashes here.

Provenance

The following attestation bundles were made for synthetic_dataset_gen-1.0.0-py3-none-any.whl:

Publisher: publish.yml on obielin/synthetic-data-gen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page