Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds
Project description
leadforge
Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds.
Created by Shay Palachy Affek.
leadforge generates narrative-grounded synthetic revenue datasets — starting with lead scoring — designed for teaching, portfolio projects, and research. Rather than sampling rows from a distribution, it simulates a commercial world: a specific company, selling a specific product, to a specific kind of buyer, and renders realistic CRM-style outputs from that world.
Docs: leadforge-dev.github.io/leadforge · Dataset: HuggingFace · Kaggle: Intro · Intermediate · Advanced
What Makes LeadForge Different
- World-first generation: datasets are rendered from simulated companies, products, buyers, activities, opportunities, and outcomes.
- Relational CRM shape: output includes normalized tables plus task-ready train/validation/test splits for lead scoring.
- Pedagogical realism: snapshot discipline, redaction modes, leakage traps, calibration issues, and difficulty tiers are deliberate teaching material.
Installation
Requires Python 3.11+.
pip install leadforge
Or install directly from GitHub:
pip install git+https://github.com/leadforge-dev/leadforge.git
For development:
git clone https://github.com/leadforge-dev/leadforge.git
cd leadforge
pip install -e ".[dev]"
pre-commit install
Quickstart
CLI
# List available recipes
leadforge list-recipes
# Generate a dataset bundle
leadforge generate \
--recipe b2b_saas_procurement_v1 \
--seed 42 \
--mode student_public \
--difficulty intermediate \
--n-leads 5000 \
--out ./out/demo_bundle
# Inspect bundle metadata
leadforge inspect ./out/demo_bundle
# Or pipe the manifest into jq
leadforge inspect ./out/demo_bundle --json | jq .snapshot_day
# Validate bundle integrity
leadforge validate ./out/demo_bundle
Python API
from leadforge.api import Generator
gen = Generator.from_recipe(
"b2b_saas_procurement_v1",
seed=42,
exposure_mode="student_public",
)
bundle = gen.generate(n_leads=5000, difficulty="intermediate")
bundle.save("./out/demo_bundle")
Generated Data Preview
A generated bundle looks like CRM and GTM data, not a generic tabular benchmark. This compact slice comes from the intermediate lead-scoring bundle:
| split | industry | region | employee_band | lead_source | touch_count | session_count | opportunity_created | expected_acv | converted_within_90_days |
|---|---|---|---|---|---|---|---|---|---|
| train | logistics | UK | 200-499 | inbound_marketing | 0 | 0 | False | 66,699 | False |
| train | logistics | UK | 500-999 | inbound_marketing | 5 | 2 | False | 58,372 | False |
| train | logistics | US | 200-499 | partner_referral | 9 | 3 | True | 15,462 | False |
| train | healthcare_non_clinical | US | 200-499 | inbound_marketing | 5 | 1 | True | 30,490 | False |
| train | manufacturing | US | 1000-1999 | sdr_outbound | missing | 1 | True | 42,999 | False |
The full bundle also includes accounts, contacts, leads, touches, sessions, sales activities, opportunities, feature dictionaries, manifests, and model-ready Parquet task splits.
Exposure Modes
Control what truth is visible in the output bundle:
| Mode | Purpose | Includes |
|---|---|---|
student_public |
Teaching / portfolio use | Tables, features, task splits, dataset card |
research_instructor |
Full truth for instructors / researchers | All of the above + hidden graph, world spec, latent registry, mechanism summary |
Set via --mode on the CLI or exposure_mode= in the Python API.
Difficulty Profiles
Each recipe ships with difficulty profiles that control signal-to-noise ratio:
| Profile | Description |
|---|---|
intro |
Strong signal, low noise — good for first-time learners |
intermediate |
Moderate signal, realistic noise |
advanced |
Weak signal, high noise — challenges experienced practitioners |
Set via --difficulty on the CLI or difficulty= in generate().
Output Bundle
bundle_root/
manifest.json # provenance, row counts, file hashes
dataset_card.md # human-readable dataset documentation
feature_dictionary.csv # feature names, types, descriptions
tables/ # 9 relational Parquet tables
tasks/
converted_within_90_days/
train.parquet
valid.parquet
test.parquet
task_manifest.json
metadata/ # (research_instructor only) hidden graph, world spec, latents
Key Design Principles
- Deterministic: same (recipe, seed, version) → identical output.
- Relational-first: 9 normalized tables; flat ML exports are derived.
- No external APIs: core generation never requires network access.
- Simulation-driven labels:
converted_within_90_daysemerges from simulated events, not sampled directly. - Leakage-safe: no feature uses events after the snapshot anchor.
Documentation
Development
pip install -e ".[dev]"
pytest # run all tests (~800)
ruff check . # lint
ruff format . # format
mypy leadforge/ # type check
pre-commit run --all-files # full pre-commit suite
License
MIT. See LICENSE.
Credits
Created by Shay Palachy Affek [GitHub]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leadforge-1.0.0.tar.gz.
File metadata
- Download URL: leadforge-1.0.0.tar.gz
- Upload date:
- Size: 183.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78f19875cbb6f2eef46e4ef26fb7397a64a786f57e871cbccaa67a4b70445f6a
|
|
| MD5 |
98aa5d5ed6ddae9912f61cbdec83ae75
|
|
| BLAKE2b-256 |
8144a1b70c6f67776b0e09816b5f4d75cb6b77c10d5b5a215e5e80b4c63810eb
|
File details
Details for the file leadforge-1.0.0-py3-none-any.whl.
File metadata
- Download URL: leadforge-1.0.0-py3-none-any.whl
- Upload date:
- Size: 207.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
610c3620ccbbc16106630798644e537329672726885f4f9d4920d49b8dcccf32
|
|
| MD5 |
43626e27a8c62982f19b40f27150ee6a
|
|
| BLAKE2b-256 |
eb8ac50f9be26315556f18c09a29fdfd0cbd34de6eb35475fa1b0d56088e3717
|