Skip to main content

Multi-domain, schema-aware synthetic data generator for Microsoft Fabric

Project description

Spindle by SQLLocks

Spindle by SQLLocks

"Synthea is to MITRE as Spindle is to SQLLocks"

Spindle is a multi-domain, schema-aware synthetic data generator for Microsoft Fabric. It generates statistically realistic, relationally correct datasets — think normalized 3NF schemas with proper FK integrity, Pareto order distributions, seasonal temporal patterns, and real US addresses with lat/lng coordinates ready for Power BI maps.

pip install sqllocks-spindle

Quick Start

from sqllocks_spindle import Spindle, RetailDomain

spindle = Spindle()
result = spindle.generate(
    domain=RetailDomain(),
    scale="small",
    seed=42
)

print(result)
# GenerationResult(9 tables, 21,300 total rows, 0.3s)

# Access any table as a pandas DataFrame
customers = result["customer"]
orders    = result["order"]
addresses = result["address"]

# Check referential integrity
errors = result.verify_integrity()
assert errors == []

# Print a generation summary
print(result.summary())

Domains

Spindle ships 12 production-ready domains — each with calibrated distribution profiles, referential integrity enforcement, and 20+ passing tests:

Domain Tables Description
Retail 9 Customers, products, orders, returns — 3NF normalized
Healthcare 9 Patients, encounters, diagnoses, claims — 3NF normalized
Financial 10 Branches, accounts, transactions, loans, fraud detection
Supply Chain 10 Warehouses, suppliers, POs, inventory, shipments
IoT 8 Devices, sensors, readings, alerts, maintenance
HR 9 Employees, departments, compensation, performance
Insurance 9 Agents, policies, claims, underwriting, payments
Marketing 10 Campaigns, contacts, leads, opportunities, conversions
Education 9 Students, courses, enrollments, grades, financial aid
Real Estate 9 Agents, listings, offers, transactions, inspections
Manufacturing 9 Production lines, work orders, quality control, equipment
Telecom 9 Subscribers, service lines, usage records, billing, churn

Each domain ships with calibrated distribution profiles based on real-world data (see METHODOLOGY.md).


Retail Domain

The built-in RetailDomain generates a fully normalized retail schema:

Table Small scale Description
customer 1,000 Individual customers with loyalty tiers
address 1,500 Shipping/billing addresses with real US lat/lng
product_category 50 3-level hierarchy (dept → category → subcategory)
product 500 SKUs with correlated cost/price
store 150 Physical and online stores
promotion 200 Discount campaigns
order 5,000 Order headers with Pareto customer distribution
order_line ~12,500 Line items with discount_percent
return ~850 Returns with dates derived from order dates

Scale presets: small, medium (50K customers), large (500K), xlarge (5M)


Healthcare Domain

The HealthcareDomain models clinical encounters, claims, and medications:

Table Small scale Description
provider 200 Physicians, NPs, PAs with credentials
facility 50 Hospitals, clinics, urgent care centers
patient 1,000 Patient demographics and insurance
encounter 5,000 Office visits, ED, inpatient, telehealth
diagnosis ~9,000 ICD-10 codes linked to encounters
procedure ~6,000 CPT procedures with charges
medication ~4,500 Prescriptions with dosage and supply
claim ~4,750 Insurance claims with status
claim_line ~11,875 Claim line items with copays and adjustments

All distributions calibrated from CMS, CDC, AAMC, KFF, and BLS data — see METHODOLOGY.md.

What makes it realistic

  • Pareto orders — 20% of customers place 80% of orders (max_per_parent=50 hard cap)
  • Seasonal patterns — November/December peaks, Friday/Saturday peaks, bimodal hour distribution
  • Real addresses — 40,977 US ZIP codes from GeoNames (CC-BY-4.0): city, state, ZIP, lat, lng. Works directly in Power BI map visuals.
  • Correlated cost/price — product cost is always 30–70% of unit price
  • Proper hierarchy — product categories form a real 3-level tree
  • Business rules enforced — return dates always after order dates, order dates after signup dates

Address data for Power BI

addr = result["address"]
print(addr[["city", "state", "zip_code", "lat", "lng"]].head())
#            city state zip_code        lat         lng
# 0        Reform    AL    35481  33.314928  -88.042923
# 1       Chinook    MT    59523  48.487741 -109.261678

Drop the lat/lng columns directly into a Power BI map visual — no geocoding required.


Generation Strategies

Spindle supports 20 column-level strategies:

Strategy Description
sequence Auto-incrementing integer PKs
uuid UUID v4 alternative PKs
faker Faker library providers (names, emails, etc.)
weighted_enum Weighted random selection from a set of values
distribution Statistical distributions: uniform, normal, log_normal, pareto, zipf, geometric, bernoulli, bimodal
temporal Time-aware dates: uniform or seasonal with day/month/hour profiles
formula Computed from other columns: quantity * unit_price * (1 - discount_percent / 100)
derived Derived from another column with a transformation: return_date = order_date + N days
correlated Mathematically related to another column: cost = unit_price * 0.30–0.70
conditional Conditional on another column's value
lifecycle Phase-based status values (introduced / active / discontinued)
foreign_key FK references with uniform, Pareto, or Zipf distribution
lookup Copy value from parent table via FK
reference_data Pick from bundled JSON datasets
pattern Formatted strings: Store #{seq:4}
computed Aggregated from child table (e.g., order_total = sum of line_totals)
self_referencing FK to same table for hierarchy columns
self_ref_field Read level info stashed by self_referencing
record_sample Sample complete records from a reference dataset (anchor)
record_field Read a field from a previously sampled record (correlated derived columns)

Distribution Profiles

Every domain ships with a default profile calibrated from real-world data. You can override any distribution weight:

# Override specific distributions
domain = RetailDomain(overrides={
    "customer.loyalty_tier": {"Basic": 0.40, "Silver": 0.30, "Gold": 0.20, "Platinum": 0.10},
    "order.status": {"completed": 0.85, "shipped": 0.05, "processing": 0.02, "cancelled": 0.03, "returned": 0.05},
})

# Use a named profile
domain = HealthcareDomain(profile="medicare")

# Check what's available
print(domain.available_profiles)   # ['default']
print(domain.profile_name)         # 'default'

Profile files live in domains/<name>/profiles/ and follow the same JSON schema as default.json. See METHODOLOGY.md for the full list of distribution keys and their real-world sources.


Custom Schemas

from sqllocks_spindle import Spindle

spindle = Spindle()
result = spindle.generate(
    schema="path/to/my_schema.spindle.json",
    scale_overrides={"customer": 10000, "order": 100000},
    seed=42
)

Schemas are defined in .spindle.json files. See PHASE-0-SPEC.md for the full schema definition format.


CLI

# Generate retail data at small scale
spindle generate retail --scale small --seed 42 --output ./output/

# Dry run — show what would be generated without generating
spindle generate retail --scale medium --dry-run

# Generate healthcare data as Parquet
spindle generate healthcare --scale small --format parquet --output ./data

# Stream retail orders to a file (fast mode)
spindle stream retail --table order --max-events 5000 --sink file --output events.jsonl

# Stream with real-time rate limiting and a burst window
spindle stream retail --table order --rate 100 --realtime --burst 30:60:10 --sink console

# Export to star schema (dim_* + fact_* tables as CSV)
spindle to-star retail --scale small --output ./star/

# Export to CDM folder (model.json + entity CSV files)
spindle to-cdm retail --scale small --output ./cdm/

# Describe a domain's schema and active profile
spindle describe retail

# List available domains and profiles
spindle list

# Validate a schema file
spindle validate my_schema.spindle.json

Scenario Packs

Scenario Packs are YAML-driven simulation blueprints that describe how Spindle data should be delivered — file drop cadence, stream topology, failure injection, validation gates, and Fabric target paths. Spindle ships 44 built-in packs across 11 industry verticals.

Pack types

Pack ID Kind What it simulates
fd_daily_batch file_drop Daily partitioned file drop to OneLake landing zone (Parquet/CSV)
fd_schema_drift file_drop Daily batch with progressive schema drift injection
st_realtime_events stream Real-time event stream to Fabric Eventstream (200 events/sec, burst, anomalies)
hy_stream_plus_microbatch hybrid Lambda-style: stream to Eventhouse + micro-batch files to Lakehouse every 15 min

Available verticals

financial · healthcare · retail · manufacturing · iot · telecom · supply_chain · hr · marketing · insurance · real_estate

Quick start

from sqllocks_spindle import RetailDomain
from sqllocks_spindle.packs.loader import PackLoader
from sqllocks_spindle.packs.runner import PackRunner

# Load a built-in pack
loader = PackLoader()
pack = loader.load_builtin("retail", "fd_daily_batch")

# Run it end-to-end
result = PackRunner().run(pack, domain=RetailDomain(), scale="small", seed=42)
print(result.summary())
# Pack Run: SUCCESS
#   Pack:    fd_daily_batch
#   Domain:  retail
#   Scale:   small
#   Elapsed: 0.4s
#   Files:   5
#   Events:  0
#   Validation gates:
#     schema_conformance: PASS
#     referential_integrity: PASS

List all built-in packs

for p in PackLoader().list_builtin():
    print(f"{p['domain']:15} {p['pack_id']}")

Load a custom pack

pack = PackLoader().load("path/to/my_custom_pack.yaml")
result = PackRunner().run(pack, domain=MyDomain(), scale="medium")

Pack YAML structure

pack_version: 1
id: fd_daily_batch
kind: file_drop          # file_drop | stream | hybrid
domain: retail
description: Daily batch landing zone drop for Retail domain.
fabric_targets:
  lakehouse_files_root: Files/landing/retail
file_drop:
  cadence: daily
  formats: [parquet, csv]
  entities: [customer, order, order_line, product, store]
  manifest:
    enabled: true
    name: manifest_{dt}.json
  done_flag:
    enabled: true
    name: done_{dt}.flag
  lateness:
    enabled: true
    probability: 0.10
    max_days_late: 3
  duplicates:
    enabled: true
    probability: 0.02
failure_injection:
  enabled: true
  corrupt_file_probability: 0.01
  schema_drift:
    enabled: false
    mode: additive         # additive | additive_then_breaking | breaking_only
validation:
  required_gates: [schema_conformance, referential_integrity]
  quarantine_folder: Files/quarantine/retail

Validation gates

Gate What it checks
schema_conformance All expected tables generated with correct columns
referential_integrity No orphaned FK references
row_count Every table has at least 1 row
null_check Non-nullable columns contain no nulls
uniqueness Primary key columns are unique

Development

# Create virtual environment
python3.13 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Regenerate bundled address data from GeoNames
python scripts/build_address_data.py

Third-Party Data

Address data (city, state, ZIP, lat, lng) is sourced from GeoNames under Creative Commons Attribution 4.0 International (CC-BY-4.0). See LICENSE-NOTICES.md.


License

MIT — see LICENSE

Roadmap

  • Phase 0 ✅ Core engine, 21 strategies, Retail + Healthcare domains, calibrated profiles, 103 tests
  • Phase 1 ✅ Fabric Lakehouse writer, CSV/Parquet/Delta/JSONL/Excel output, CLI
  • Phase 2 ✅ Streaming engine — SpindleStreamer, Poisson inter-arrivals, token-bucket rate limiting, AnomalyRegistry (point/contextual/collective), Event Hub + Kafka sinks, spindle stream CLI
  • Phase 3 ✅ Domain expansion — 10 new domains (12 total), 409 tests, shared reference data
  • Phase 4 ✅ spindle-forge MCP server — TypeScript bridge with spindle_list_domains, spindle_describe_domain, spindle_generate
  • Phase 5 ✅ PyPI packaging, GitHub Actions CI/CD, sample notebooks
  • Phase 6 ✅ Star schema output, CDM folder export, fabric_demo + warehouse scale presets, spindle to-star / spindle to-cdm CLI
  • Phase 7 ✅ Chaos engine, simulation layer (file-drop / stream / hybrid), 44 built-in scenario packs, GSL spec parser, run manifests, validation gates + quarantine, CompositeDomain, SharedEntityRegistry, EventEnvelope, Fabric integration (OneLakePaths, LakehouseFilesWriter, EventstreamClient), MCP bridge — 989 tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqllocks_spindle-1.3.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqllocks_spindle-1.3.0-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file sqllocks_spindle-1.3.0.tar.gz.

File metadata

  • Download URL: sqllocks_spindle-1.3.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sqllocks_spindle-1.3.0.tar.gz
Algorithm Hash digest
SHA256 11c9dfc850dcd35316626ca0c1a5ad23645181d5f0288cbfd00728536c635f7f
MD5 8eb90efe61322b853ec24eabc7e43340
BLAKE2b-256 4cbdc32c3d5750d6d46aeaf58624ac900529164a1b021525d893e817f63f6428

See more details on using hashes here.

File details

Details for the file sqllocks_spindle-1.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sqllocks_spindle-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a217db6a1366581691efab814a760ca076336819a44a41bf6ce6b2d31c83a42
MD5 6c55da52162f288d09255d1665e34a8c
BLAKE2b-256 0e638bd92a2c9e922e3183d3b2f77d45e1f7614092f730953d5888c11d25b92d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page