Skip to main content

Context-aware synthetic data generation for coherent Pydantic domain models.

Project description

Verisim

Context-aware synthetic data for Python.

The name comes from "verisimilitude," meaning "the appearance of being real."

Verisim generates whole, coherent Pydantic domain objects instead of unrelated random fields. A generated person can have a name, username, email, phone, address, job, company, bio, website, and social profiles that all make sense together.

Project status: early prototype. The current package includes the core engine, Pydantic models, a lite data pack, examples, and full test coverage. Large global data packs and AI prose adapters are extension points, not finished product features yet.

Why Verisim Exists

Libraries like Faker are excellent at generating individual fake values. The problem starts when those values need to belong to the same fictional person, company, or dataset.

Typical generated records often look fake because each field is created in isolation:

  • the name and username do not belong together,
  • the bio has nothing to do with the job,
  • the phone number does not match the country,
  • the website domain is unrelated to the person or company,
  • every social profile reuses the same handle,
  • the address may look formatted but not geographically coherent.

Here is what that looks like in practice:

Faker: plausible fields, isolated from each other Verisim: one generated person, shared context
from faker import Faker
<p>fake = Faker("en_US")</p>
<p>person = {
"name": fake.name(),
"username": fake.user_name(),
"email": fake.email(),
"phone": fake.phone_number(),
"address": fake.address(),
"job": fake.job(),
"company": fake.company(),
"bio": fake.sentence(),
"website": fake.url(),
}

from verisim import PersonRecord, Verisim
<p>v = Verisim(locale="en_US", seed=123)
record = v.generate(PersonRecord)</p>
<p>person = {
"name": record.person.name,
"username": record.person.username,
"email": record.contact.email,
"phone": record.contact.phone.e164,
"address": (
f"{record.address.city}, "
f"{record.address.region_code} "
f"{record.address.postal_code}"
),
"job": record.job.title,
"company": record.company.name,
"bio": record.bio,
"website": record.website.url,
}

{
  "name": "Maya Rao",
  "username": "thomas77",
  "email": "melissa.watson@example.net",
  "phone": "+1-202-555-0188",
  "address": "4896 James Station\nPhoenix, AZ 85004",
  "job": "Marine scientist",
  "company": "Northstar Medical Group",
  "bio": "Writes about fintech compliance.",
  "website": "https://miller-johnson.example.org/"
}

Each value is believable alone. Together, it is a person whose name, login, inbox, job, company, bio, and website all point in different directions.

{
  "name": "Brooke Garcia",
  "username": "brooke.garcia",
  "email": "brooke.garcia@kindred-medical-group.example.invalid",
  "phone": "+14155550000",
  "address": "San Francisco, CA 94107",
  "job": "Product Manager",
  "company": "Kindred Medical Group",
  "bio": "Brooke Garcia works as a Product Manager at Kindred Medical Group...",
  "website": "https://brooke.garcia.example.invalid"
}

The same facts carry through the record: name to username, email, website, city-aware contact data, company, job, and bio.

Verisim treats fake data as a domain modeling problem. It generates an aggregate record through a dependency-aware context graph, so later fields can use facts from earlier fields. Address generation knows about country, region, city, and postal code. Contact generation knows about the address country. Social generation knows about the person, job, and company. Bio generation knows about the job and industry. Company records carry their own scale, legal form, departments, leadership, domains, and email pattern, and those facts propagate when generating people for that company.

The result is synthetic data that is still safe and fake, but believable enough for demos, seed data, tests, prototypes, and synthetic datasets.

Install

Install from PyPI with uv:

uv add verisim

Or install with pip:

python -m pip install verisim

Install optional package tiers as they become available:

uv add "verisim[lite]"
uv add "verisim[full]"
uv add "verisim[ai]"

Development From Source

Clone the repository and install the development dependencies:

git clone https://github.com/Harshal96/verisim.git
cd verisim
uv sync --extra dev

For editable installs while working on Verisim from another local project, use a relative path to your clone:

uv add --editable ../verisim

Quickstart

from verisim import PersonRecord, Verisim

verisim = Verisim(locale="en_US", output_language="en", seed=123)
record = verisim.generate(PersonRecord)

print(record.person.name)
print(record.person.username)
print(record.contact.email)
print(record.contact.phone.e164)
print(record.address.city, record.address.region_code, record.address.postal_code)
print(record.job.title)
print(record.company.name)
print(record.bio)
print(record.model_dump_json())

Command Line Usage

Verisim also installs a Faker-inspired CLI:

verisim [OPTIONS] COMMAND [ARGS]...

Generate one coherent person record:

uv run verisim person-record --seed 123

Generate repeated records as JSON lines:

uv run verisim person-record -r 3 --locale en_US --seed 123

Generate another supported target:

uv run verisim company-record --locale en_US --indent 2

Generate a coherent dataset:

uv run verisim dataset --people 40 --companies 6 --seed 7 --indent 2

Write output to a file:

uv run verisim person-record --repeat 10 --output people.jsonl

Supported record commands are person-record, person, company-record, company, address, contact, job, socials, and website.

Example shape:

{
    "person": {
        "name": "Brooke Garcia",
        "username": "brooke.garcia"
    },
    "contact": {
        "email": "brooke.garcia@kindred-medical-group.example.invalid",
        "phone": {
            "e164": "+14155550000",
            "country_code": "US"
        }
    },
    "address": {
        "city": "San Francisco",
        "region_code": "CA",
        "postal_code": "94107",
        "country_code": "US"
    },
    "job": {
        "title": "Product Manager",
        "industry": "Healthcare Technology"
    },
    "company": {
        "name": "Kindred Medical Group",
        "industry": "Healthcare Technology"
    },
    "bio": "Brooke Garcia works as a Product Manager at Kindred Medical Group..."
}

Core Ideas

Model-first API

Verisim is used through Pydantic models:

from verisim import PersonRecord, Socials, Verisim

v = Verisim(seed=42)

person = v.generate(PersonRecord)
socials = v.generate(Socials, context=person)

JSON output comes from Pydantic:

payload = person.model_dump_json()

Context graph generation

Providers declare what they need and what they produce. Verisim resolves the graph, shares typed context between providers, and validates the generated result.

Address -> Contact
Person + Address -> Contact
Industry + founded_year -> CompanyRecord
CompanyRecord -> Company + Contact + Job
Person + Job + Company -> Socials
Person + Job + Company -> Bio
Person + Address + Contact + Job + Company + Socials -> PersonRecord

Safe by default

Generated contact details are non-routable by default. Emails and websites use synthetic .example.invalid domains, while still preserving realistic local parts, hosts, formats, and relationships. When a person is generated with company context, their email uses the company's domain and email pattern.

Generate Related Datasets

Verisim can generate coherent datasets with people assigned to generated company records:

from verisim import DatasetSpec, Verisim

v = Verisim(seed=7)
dataset = v.dataset(
    DatasetSpec(
        companies=3,
        people_per_company={"seed": 8, "startup": 25, "mid-market": 120},
    )
)

assert dataset.people[0].company.id in {company.id for company in dataset.companies}

The dataset path uses the same context-aware providers as single-record generation, so uniqueness, email domains, job industries, company size bands, and department distribution are preserved.

Use Existing Context

You can provide context and ask Verisim to generate the rest:

from verisim import Address, PersonRecord, Verisim

v = Verisim(seed=1)

address = Address(
    line1="19 Birch Street",
    city="Austin",
    region="Texas",
    region_code="TX",
    postal_code="78701",
    country="United States",
    country_code="US",
)

record = v.generate(PersonRecord, context={"address": address}, mode="repair")

Company context works the same way across calls:

from verisim import CompanyRecord, PersonRecord, Verisim

v = Verisim(seed=7)
company = v.generate(CompanyRecord, context={"size_band": "startup"})
employee = v.generate(PersonRecord, context={"company": company})

assert employee.contact.email.endswith(f"@{company.domain}")
assert employee.job.department in company.departments

Conflict modes:

  • strict: raise when supplied context contradicts model invariants.
  • repair: keep valid context and regenerate dependent conflicting fields.
  • explain: return diagnostics without generating a replacement record.

Locale And Script

Locale describes the cultural/data origin. Output language and script are separate knobs.

from verisim import PersonRecord, Verisim

v = Verisim(locale="en_IN", output_language="en", script="latin", seed=13)
record = v.generate(PersonRecord)

print(record.person.name)
print(record.address.country_code)
print(record.contact.phone.e164)

This supports Indian names in Latin script, such as Rakesh, Om, or Prakash, while keeping address and phone fields country-aware.

The lite pack includes US, UK, Canadian, Australian, Indian, and German coverage. The packaged locale codes are en_US, en_GB, en_CA, en_AU, en_IN, hi_IN, and de_DE; each includes 1,000 given names and 1,000 family names.

Country address data for US, GB, CA, AU, IN, and DE is generated from open GeoNames postal-code archives with Verisim-authored synthetic street names and suffixes. The packaged data currently contains 53 US regions, 6 UK regions, 13 Canadian regions, 8 Australian regions, 35 Indian regions, and 33 German regions, covering more than 2.8 million postal-code-to-city relationships. Canada and the UK use the GeoNames full-code archives; the standard GeoNames country ZIPs are used for the other supported countries. The source data is useful for coherent synthetic generation, not postal authority validation.

To refresh the packaged country JSON files from GeoNames:

uv run python scripts/build_country_datasets.py --download

Current Features

  • Pydantic v2 domain models for PersonRecord, CompanyRecord, Person, Address, Contact, PhoneNumber, Job, Company, Socials, Website, and datasets.
  • Context graph provider engine.
  • Per-run uniqueness registry for IDs, usernames, emails, phones, companies, and social handles.
  • Lite data pack with US, UK, Canada, Australia, India, and Germany sample support.
  • Non-routable synthetic emails, websites, and avatar URLs.
  • Strict, repair, and explain modes for existing context.
  • Importable and runnable examples package.
  • 100% measured coverage across src/verisim and examples.

Package Shape

The package declares extras for the intended product tiers:

verisim[lite]
verisim[full]
verisim[ai]

Current state:

  • lite: implemented as the built-in data pack.
  • full: reserved for large regional/global data packs.
  • ai: reserved for optional prose-generation adapters.

The core package remains offline and deterministic. AI or external data should be opt-in, auditable, and replaceable.

Examples

Run the included examples:

uv run python -m examples.basic_person
uv run python -m examples.company_record
uv run python -m examples.context_repair
uv run python -m examples.dataset_generation

Import them from Python:

from examples import basic_person, company_record, context_repair, dataset_generation

record = basic_person.generate_example(seed=123)
company = company_record.generate_example(seed=123, size_band="startup")
diagnostics, repaired = context_repair.generate_example(seed=123)
dataset = dataset_generation.generate_example(seed=123, people=5, companies=2)

Development

See CONTRIBUTING.md for the full local development and pull request workflow.

Run tests:

uv run --extra dev python -B -m pytest -q

Format and sort imports:

uv run --extra dev autoflake src examples tests
uv run --extra dev isort src examples tests
uv run --extra dev black src examples tests

Lint:

uv run --extra dev ruff check src examples tests

Check formatting and cleanup without rewriting files:

uv run --extra dev autoflake --check src examples tests
uv run --extra dev isort --check-only src examples tests
uv run --extra dev black --check src examples tests
uv run --extra dev ruff check src examples tests

Run the 100% per-file coverage gate:

uv run --extra dev python -B -m coverage run -m pytest -q
uv run --extra dev python -B -m coverage report --fail-under=100

Roadmap

TBD.

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verisim-0.1.0.tar.gz (10.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verisim-0.1.0-py3-none-any.whl (10.4 MB view details)

Uploaded Python 3

File details

Details for the file verisim-0.1.0.tar.gz.

File metadata

  • Download URL: verisim-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verisim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b189b85646bde78fcc595687c17c0a3d9ed95cb5e57432464495d172e6778ed6
MD5 fc60e0d4079fd9adf3e82d867b19dec5
BLAKE2b-256 bc3f1ae4f61f7f30949e4adb715bd7ca5d69a73b573c2188e85e2ce8f04f1ba0

See more details on using hashes here.

Provenance

The following attestation bundles were made for verisim-0.1.0.tar.gz:

Publisher: release.yml on Harshal96/verisim

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file verisim-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: verisim-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for verisim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 613022b2c5cbe86f4ca6d508e4d8d58eb7509e873e15c614ddd24a4a8935cdab
MD5 9187a6eb1753525d45130d773a5db20c
BLAKE2b-256 d674a567c035816b407f35d451d1866c08c4e59b868f2d8db3c4487fb75dfa36

See more details on using hashes here.

Provenance

The following attestation bundles were made for verisim-0.1.0-py3-none-any.whl:

Publisher: release.yml on Harshal96/verisim

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page