A standalone, schema-based data generator and bulk ingestion utility for MongoDB

Project description

mongo-synth: MongoDB Schema-Based Data Generator & Ingester

mongo-synth is a standalone Python utility and command-line tool designed to generate high-fidelity, deterministic synthetic datasets from JSON Schemas (or Pydantic models) and seed them directly into MongoDB collections at scale.

Whether you are performing database index optimization, latency stress testing, schema validation, or writing integration tests, mongo-synth allows you to rapidly populate mock databases with realistic data, statistical distributions, and edge-case anomalies.

Key Features

🧬 JSON Schema Synthesis: Translates arbitrary JSON Schema specifications (Draft 2020-12) into deterministic property-based generation strategies using hypothesis-jsonschema.
🍃 Native BSON Type Mapping: Supports MongoDB-specific types (ObjectId, ISODate, Decimal128, BinData) via custom "bsonType" schema annotations.
📊 Statistical Value Profiling: Inject real-world data properties by defining relative probability weights for specific fields (e.g., status field containing 80% active / 20% inactive).
⚡ High-Performance Bulk Ingestion: Iterates over generated streams and inserts them in configurable batch chunks via PyMongo's unordered insert_many for maximum throughput.
🚨 Anomaly & Schema Drift Injection: Test system resilience under fire by injecting whitespace key anomalies, mixed-type arrays, extreme nesting depths, emojis, or string type impersonations.
🔒 Production Safety Lock: Protects production environments by automatically asserting connection strings against a configured live database URI and blocking execution on a match.

Installation

pip install .

Quick Start

1. CLI Usage

Generate and ingest 10,000 orders into a local database using a schema:

mongo-synth \
  --schema path/to/order_schema.json \
  --uri mongodb://localhost:27017 \
  --db testing_db \
  --collection orders \
  --count 10000 \
  --clear

2. Python API Usage

from pymongo import MongoClient
from mongo_synth.generators import JsonSchemaGenerator
from mongo_synth.ingestion import DataIngester

# 1. Define your blueprint and schema
blueprint = {
    "schema": {
        "type": "object",
        "properties": {
            "_id": {"type": "string", "bsonType": "objectId"},
            "device_id": {"type": "string"},
            "status": {"type": "string", "enum": ["online", "offline"]},
            "timestamp": {"type": "string", "bsonType": "date"}
        },
        "required": ["device_id", "status"]
    },
    "metadata": {
        "profile": {
            "status": {"online": 0.9, "offline": 0.1} # 90% online, 10% offline
        }
    }
}

# 2. Generate synthetic data
generator = JsonSchemaGenerator(blueprint, documents_per_collection=5000, seed=42)
documents = generator.generate_batch()

# 3. Bulk ingest into MongoDB
client = MongoClient("mongodb://localhost:27017")
collection = client["iot_db"]["devices"]

ingester = DataIngester(
    target_collection=collection,
    target_uri="mongodb://localhost:27017",
    batch_size=1000,
    live_source_uri="mongodb+srv://prod-cluster" # Safety guardrail
)

inserted_count = ingester.ingest(documents)
print(f"Successfully seeded {inserted_count} documents.")

🔒 Sensitive Data Generation & Honeytoken Leak Verification

mongo-synth supports generating dynamic, high-fidelity Personally Identifiable Information (PII) and credentials (passwords, API keys) that can be seeded into MongoDB collections.

This feature is disabled by default to ensure clean testing, but can be enabled on-demand.

Why this feature exists

Organizations need to periodically audit their staging, development, and production environments for compliance (GDPR, HIPAA, PCI-DSS) and security leaks. Rather than using real production data (which introduces security risks and privacy compliance violations), security teams utilize Honeytokens—realistic, synthetic records that act as tripwires.

If any of the generated honeytoken values (like an API key or password) are detected in system logs, external search indexing engines, code repositories, or public paste sites, it serves as a high-confidence indicator of a data breach.

Real-World Customer Use Cases

Compliance Audit & Data Redaction: Verify that system logging frameworks, crash reporting tools, or APMs (Application Performance Monitors) correctly redact or mask sensitive PII (like Social Security Numbers or Credit Cards) before storing them in logs.
Leak Detection & Alerting (Honeytokens): Seed databases with custom API keys and passwords. Configure downstream monitoring tools (like SIEMs, Splunk, or DLP scanners) to watch for these exact values. If a value appears outside the database, alert security teams immediately.
Accidental Production Writes Identification: Use the --run-id option to prefix all sensitive values. If a value prefix is seen in logs, you can identify exactly which pipeline run or branch was responsible.
Unique Index Integrity Testing: Test that database index behaviors, constraints, and ingestion pipelines handle large volumes of high-cardinality values with graceful bulk writes.

How to Use

1. Schema-Driven Generation

Annotate any string properties in your JSON Schema with "sensitiveType":

Supported types: name, email, phone, ssn, credit_card, address, password, api_key.

{
  "type": "object",
  "properties": {
    "username": {"type": "string"},
    "personal_email": {"type": "string", "sensitiveType": "email"},
    "api_token": {"type": "string", "sensitiveType": "api_key"}
  },
  "required": ["username", "personal_email", "api_token"]
}

When generating, these fields will be populated using standard libraries (Faker for PII, cryptographically secure secrets module for credentials).

2. Automatic CLI-Driven Injection (`--inject-sensitive`)

To automatically append a set of standard sensitive fields (including nested personal_info, billing, and credentials sub-documents) to every document generated, use the --inject-sensitive flag:

mongo-synth generate \
  --schema path/to/schema.json \
  --count 1000 \
  --inject-sensitive

3. Canary Run Tagging (`--run-id`) & Localization (`--sensitive-locale`)

Canary Prefixing: Prefix generated values with a custom ID (e.g., pipeline run number or environment name) to trace the origin of a leak:

mongo-synth generate \
  --schema path/to/schema.json \
  --count 1000 \
  --inject-sensitive \
  --run-id dev_stage_pipeline_94

This prefixes names, emails, and passwords with dev_stage_pipeline_94_ and salts API keys like key_live_dev_stage_pipeline_94_....

PII Localization: Specify a locale (default en_US) to generate localized synthetic names, addresses, and phone formats (e.g., de_DE, fr_FR, en_GB):

mongo-synth generate \
  --schema path/to/schema.json \
  --count 1000 \
  --inject-sensitive \
  --sensitive-locale de_DE

4. Leak Verifiers Export (`--verifier-output`)

Export the list of all generated sensitive values to a structured JSON file to act as the leak audit checklist:

mongo-synth generate \
  --schema path/to/schema.json \
  --count 100 \
  --inject-sensitive \
  --run-id audit_run_1 \
  --verifier-output verifier_checklist.json

Example verifier_checklist.json:

[
  {
    "type": "email",
    "value": "audit_run_1_john.doe@example.com"
  },
  {
    "type": "api_key",
    "value": "key_live_audit_run_1_f8b2c4d9a..."
  }
]

Ingestion Robustness, Safety & Performance Options

When generating and inserting millions of mock documents, several database-level constraints, schema constraints, and payload limits must be handled safely:

Ordered Ingestion (--ordered): By default, mongo-synth performs unordered bulk writes (ordered=False) to maximize write speed and ignore duplicate key violations (MongoDB error codes 11000/11001) dynamically. If you require sequential database insertions where the exact order of documents matters (e.g., time-series data or referenced keys), use the --ordered flag. This will enforce sequential inserts and immediately halt execution on the first error.
Client-Side Schema Validation Dry Run (--dry-run): Under large-scale runs, sending invalid BSON structures or documents that violate schema constraints to MongoDB is slow and requires manual database cleanup. Use the --dry-run flag to generate mock documents and run the JSON Schema validator locally client-side using jsonschema without connecting or writing to MongoDB.
Dynamic Batch Resizing (Network Safeties): By default, mongo-synth bulk-inserts documents in batch size chunks (default 5000). If your schema generates extremely large records (e.g., deeply nested subdocuments, large arrays, or long texts), a large batch can exceed MongoDB's maximum 16MB BSON payload limit. The ingestion pipeline automatically samples the BSON size of generated documents during initial ingestion and dynamically scales down the batch size if needed to fit under a safe 12MB limit.
Parallel Ingestion Workers (--workers): Python's single-threaded execution can limit both CPU generation throughput and I/O write speed. Use the --workers W flag (with W > 1) to generate and ingest documents concurrently in W isolated processes. When running in parallel:
- The total count is split evenly across workers.
- If a master seed is provided, seeds are offset (master_seed + worker_index) to guarantee that workers generate distinct datasets.
- Target collection clearing (--clear) is coordinated once by the parent process.
- Leak verifiers are collected and merged across all workers.

Any structural errors (such as MongoDB Schema Document Validation failures, error code 121) are re-raised immediately to prevent silent configuration or constraint validation bugs.

Project details

Release history Release notifications | RSS feed

This version

1.1.0

Jun 4, 2026

1.0.2

Jun 2, 2026

1.0.1

Jun 2, 2026

1.0.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mongo_synth-1.1.0.tar.gz (32.2 kB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mongo_synth-1.1.0-py3-none-any.whl (33.1 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file mongo_synth-1.1.0.tar.gz.

File metadata

Download URL: mongo_synth-1.1.0.tar.gz
Upload date: Jun 4, 2026
Size: 32.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mongo_synth-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`da2cdc00bdedaf7779907a3e46cc271181f68007277ec399239fbf350a972d0a`
MD5	`07c05a102d9d5d0c22b7b2246622688f`
BLAKE2b-256	`766af2c9a2c4266e3ab74b61aaa330246f59e520a6f47c67bacb55519f90909b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mongo_synth-1.1.0.tar.gz:

Publisher: publish.yml on JMartynov/mongo-synth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mongo_synth-1.1.0.tar.gz
- Subject digest: da2cdc00bdedaf7779907a3e46cc271181f68007277ec399239fbf350a972d0a
- Sigstore transparency entry: 1720585180
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: JMartynov/mongo-synth@f46c8c21473c00392732d502f244141063017ed1
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/JMartynov
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f46c8c21473c00392732d502f244141063017ed1
- Trigger Event: push

File details

Details for the file mongo_synth-1.1.0-py3-none-any.whl.

File metadata

Download URL: mongo_synth-1.1.0-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 33.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mongo_synth-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`30db703af7b53bfae57119039a685dccbbcdc03f6f4cb9203da58e194978e03a`
MD5	`015c612d195aba6b597bba23ca982701`
BLAKE2b-256	`5069e4f62de10bec403cf9e3b5fbe611c70f2cb7810519fc7fec9add460a090d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mongo_synth-1.1.0-py3-none-any.whl:

Publisher: publish.yml on JMartynov/mongo-synth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mongo_synth-1.1.0-py3-none-any.whl
- Subject digest: 30db703af7b53bfae57119039a685dccbbcdc03f6f4cb9203da58e194978e03a
- Sigstore transparency entry: 1720585333
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: JMartynov/mongo-synth@f46c8c21473c00392732d502f244141063017ed1
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/JMartynov
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f46c8c21473c00392732d502f244141063017ed1
- Trigger Event: push

mongo-synth 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

mongo-synth: MongoDB Schema-Based Data Generator & Ingester

Key Features

Installation

Quick Start

1. CLI Usage

2. Python API Usage

🔒 Sensitive Data Generation & Honeytoken Leak Verification

Why this feature exists

Real-World Customer Use Cases

How to Use

1. Schema-Driven Generation

2. Automatic CLI-Driven Injection (`--inject-sensitive`)

3. Canary Run Tagging (`--run-id`) & Localization (`--sensitive-locale`)

4. Leak Verifiers Export (`--verifier-output`)

Ingestion Robustness, Safety & Performance Options

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

mongo-synth 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

mongo-synth: MongoDB Schema-Based Data Generator & Ingester

Key Features

Installation

Quick Start

1. CLI Usage

2. Python API Usage

🔒 Sensitive Data Generation & Honeytoken Leak Verification

Why this feature exists

Real-World Customer Use Cases

How to Use

1. Schema-Driven Generation

2. Automatic CLI-Driven Injection (--inject-sensitive)

3. Canary Run Tagging (--run-id) & Localization (--sensitive-locale)

4. Leak Verifiers Export (--verifier-output)

Ingestion Robustness, Safety & Performance Options

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

2. Automatic CLI-Driven Injection (`--inject-sensitive`)

3. Canary Run Tagging (`--run-id`) & Localization (`--sensitive-locale`)

4. Leak Verifiers Export (`--verifier-output`)