Skip to main content

Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds

Project description

leadforge

CI Docs License: MIT Python 3.11+

Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds.

Created by Shay Palachy Affek.

leadforge generates narrative-grounded synthetic revenue datasets — starting with lead scoring — designed for teaching, portfolio projects, and research. Rather than sampling rows from a distribution, it simulates a commercial world: a specific company, selling a specific product, to a specific kind of buyer, and renders realistic CRM-style outputs from that world.

Docs: leadforge-dev.github.io/leadforge · Dataset: HuggingFace · Kaggle: Intro · Intermediate · Advanced


What Makes LeadForge Different

  • World-first generation: datasets are rendered from simulated companies, products, buyers, activities, opportunities, and outcomes.
  • Relational CRM shape: output includes normalized tables plus task-ready train/validation/test splits for lead scoring.
  • Pedagogical realism: snapshot discipline, redaction modes, leakage traps, calibration issues, and difficulty tiers are deliberate teaching material.

Installation

Requires Python 3.11+.

pip install leadforge

Or install directly from GitHub:

pip install git+https://github.com/leadforge-dev/leadforge.git

For development:

git clone https://github.com/leadforge-dev/leadforge.git
cd leadforge
pip install -e ".[dev]"
pre-commit install

Quickstart

CLI

# List available recipes
leadforge list-recipes

# Generate a dataset bundle
leadforge generate \
  --recipe b2b_saas_procurement_v1 \
  --seed 42 \
  --mode student_public \
  --difficulty intermediate \
  --n-leads 5000 \
  --out ./out/demo_bundle

# Inspect bundle metadata
leadforge inspect ./out/demo_bundle

# Or pipe the manifest into jq
leadforge inspect ./out/demo_bundle --json | jq .snapshot_day

# Validate bundle integrity
leadforge validate ./out/demo_bundle

Python API

from leadforge.api import Generator

gen = Generator.from_recipe(
    "b2b_saas_procurement_v1",
    seed=42,
    exposure_mode="student_public",
)
bundle = gen.generate(n_leads=5000, difficulty="intermediate")
bundle.save("./out/demo_bundle")

Generated Data Preview

A generated bundle looks like CRM and GTM data, not a generic tabular benchmark. This compact slice comes from the intermediate lead-scoring bundle:

split industry region employee_band lead_source touch_count session_count opportunity_created expected_acv converted_within_90_days
train logistics UK 200-499 inbound_marketing 0 0 False 66,699 False
train logistics UK 500-999 inbound_marketing 5 2 False 58,372 False
train logistics US 200-499 partner_referral 9 3 True 15,462 False
train healthcare_non_clinical US 200-499 inbound_marketing 5 1 True 30,490 False
train manufacturing US 1000-1999 sdr_outbound missing 1 True 42,999 False

The full bundle also includes accounts, contacts, leads, touches, sessions, sales activities, opportunities, feature dictionaries, manifests, and model-ready Parquet task splits.


Exposure Modes

Control what truth is visible in the output bundle:

Mode Purpose Includes
student_public Teaching / portfolio use Tables, features, task splits, dataset card
research_instructor Full truth for instructors / researchers All of the above + hidden graph, world spec, latent registry, mechanism summary

Set via --mode on the CLI or exposure_mode= in the Python API.


Difficulty Profiles

Each recipe ships with difficulty profiles that control signal-to-noise ratio:

Profile Description
intro Strong signal, low noise — good for first-time learners
intermediate Moderate signal, realistic noise
advanced Weak signal, high noise — challenges experienced practitioners

Set via --difficulty on the CLI or difficulty= in generate().


Output Bundle

bundle_root/
  manifest.json            # provenance, row counts, file hashes
  dataset_card.md          # human-readable dataset documentation
  feature_dictionary.csv   # feature names, types, descriptions
  tables/                  # 9 relational Parquet tables
  tasks/
    converted_within_90_days/
      train.parquet
      valid.parquet
      test.parquet
      task_manifest.json
  metadata/                # (research_instructor only) hidden graph, world spec, latents

Key Design Principles

  • Deterministic: same (recipe, seed, version) → identical output.
  • Relational-first: 9 normalized tables; flat ML exports are derived.
  • No external APIs: core generation never requires network access.
  • Simulation-driven labels: converted_within_90_days emerges from simulated events, not sampled directly.
  • Leakage-safe: no feature uses events after the snapshot anchor.

Documentation


Development

pip install -e ".[dev]"
pytest                        # run all tests (~800)
ruff check .                  # lint
ruff format .                 # format
mypy leadforge/               # type check
pre-commit run --all-files    # full pre-commit suite

License

MIT. See LICENSE.


Credits

Created by Shay Palachy Affek [GitHub]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leadforge-1.0.0.tar.gz (183.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leadforge-1.0.0-py3-none-any.whl (207.3 kB view details)

Uploaded Python 3

File details

Details for the file leadforge-1.0.0.tar.gz.

File metadata

  • Download URL: leadforge-1.0.0.tar.gz
  • Upload date:
  • Size: 183.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for leadforge-1.0.0.tar.gz
Algorithm Hash digest
SHA256 78f19875cbb6f2eef46e4ef26fb7397a64a786f57e871cbccaa67a4b70445f6a
MD5 98aa5d5ed6ddae9912f61cbdec83ae75
BLAKE2b-256 8144a1b70c6f67776b0e09816b5f4d75cb6b77c10d5b5a215e5e80b4c63810eb

See more details on using hashes here.

File details

Details for the file leadforge-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: leadforge-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for leadforge-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 610c3620ccbbc16106630798644e537329672726885f4f9d4920d49b8dcccf32
MD5 43626e27a8c62982f19b40f27150ee6a
BLAKE2b-256 eb8ac50f9be26315556f18c09a29fdfd0cbd34de6eb35475fa1b0d56088e3717

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page