Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds

These details have not been verified by PyPI

Project description

leadforge

Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds.

leadforge generates narrative-grounded synthetic revenue datasets — starting with lead scoring — designed for teaching, portfolio projects, and research. Rather than sampling rows from a distribution, it simulates a commercial world: a specific company, selling a specific product, to a specific kind of buyer, and renders realistic CRM-style outputs from that world.

Docs: leadforge-dev.github.io/leadforge · Dataset: HuggingFace · Kaggle: Intro · Intermediate · Advanced

What Makes LeadForge Different

World-first generation: datasets are rendered from simulated companies, products, buyers, activities, opportunities, and outcomes.
Relational CRM shape: output includes normalized tables plus task-ready train/validation/test splits for lead scoring.
Pedagogical realism: snapshot discipline, redaction modes, leakage traps, calibration issues, and difficulty tiers are deliberate teaching material.

Installation

Requires Python 3.11+.

pip install leadforge

Or install directly from GitHub:

pip install git+https://github.com/leadforge-dev/leadforge.git

For development:

git clone https://github.com/leadforge-dev/leadforge.git
cd leadforge
pip install -e ".[dev]"
pre-commit install

Quickstart

CLI

# List available recipes
leadforge list-recipes

# Generate a dataset bundle
leadforge generate \
  --recipe b2b_saas_procurement_v1 \
  --seed 42 \
  --mode student_public \
  --difficulty intermediate \
  --n-leads 5000 \
  --out ./out/demo_bundle

# Inspect bundle metadata
leadforge inspect ./out/demo_bundle

# Or pipe the manifest into jq
leadforge inspect ./out/demo_bundle --json | jq .snapshot_day

# Validate bundle integrity
leadforge validate ./out/demo_bundle

Python API

from leadforge.api import Generator

gen = Generator.from_recipe(
    "b2b_saas_procurement_v1",
    seed=42,
    exposure_mode="student_public",
)
bundle = gen.generate(n_leads=5000, difficulty="intermediate")
bundle.save("./out/demo_bundle")

Generated Data Preview

A generated bundle looks like CRM and GTM data, not a generic tabular benchmark. This compact slice comes from the intermediate lead-scoring bundle:

split	industry	region	employee_band	lead_source	touch_count	session_count	opportunity_created	expected_acv	converted_within_90_days
train	logistics	UK	200-499	inbound_marketing	0	0	False	66,699	False
train	logistics	UK	500-999	inbound_marketing	5	2	False	58,372	False
train	logistics	US	200-499	partner_referral	9	3	True	15,462	False
train	healthcare_non_clinical	US	200-499	inbound_marketing	5	1	True	30,490	False
train	manufacturing	US	1000-1999	sdr_outbound	missing	1	True	42,999	False

The full bundle also includes accounts, contacts, leads, touches, sessions, sales activities, opportunities, feature dictionaries, manifests, and model-ready Parquet task splits.

Exposure Modes

Control what truth is visible in the output bundle:

Mode	Purpose	Includes
`student_public`	Teaching / portfolio use	Tables, features, task splits, dataset card
`research_instructor`	Full truth for instructors / researchers	All of the above + hidden graph, world spec, latent registry, mechanism summary

Set via --mode on the CLI or exposure_mode= in the Python API.

Difficulty Profiles

Each recipe ships with difficulty profiles that control signal-to-noise ratio:

Profile	Description
`intro`	Strong signal, low noise — good for first-time learners
`intermediate`	Moderate signal, realistic noise
`advanced`	Weak signal, high noise — challenges experienced practitioners

Set via --difficulty on the CLI or difficulty= in generate().

Output Bundle

bundle_root/
  manifest.json            # provenance, row counts, file hashes
  dataset_card.md          # human-readable dataset documentation
  feature_dictionary.csv   # feature names, types, descriptions
  tables/                  # 9 relational Parquet tables
  tasks/
    converted_within_90_days/
      train.parquet
      valid.parquet
      test.parquet
      task_manifest.json
  metadata/                # (research_instructor only) hidden graph, world spec, latents

Key Design Principles

Deterministic: same (recipe, seed, version) → identical output.
Relational-first: 9 normalized tables; flat ML exports are derived.
No external APIs: core generation never requires network access.
Simulation-driven labels: converted_within_90_days emerges from simulated events, not sampled directly.
Leakage-safe: no feature uses events after the snapshot anchor.

Documentation

Development

pip install -e ".[dev]"
pytest                        # run all tests (~800)
ruff check .                  # lint
ruff format .                 # format
mypy leadforge/               # type check
pre-commit run --all-files    # full pre-commit suite

License

MIT. See LICENSE.

Credits

Created by Shay Palachy Affek [GitHub]

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leadforge-1.0.0.tar.gz (183.7 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

leadforge-1.0.0-py3-none-any.whl (207.3 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file leadforge-1.0.0.tar.gz.

File metadata

Download URL: leadforge-1.0.0.tar.gz
Upload date: Jun 1, 2026
Size: 183.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for leadforge-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`78f19875cbb6f2eef46e4ef26fb7397a64a786f57e871cbccaa67a4b70445f6a`
MD5	`98aa5d5ed6ddae9912f61cbdec83ae75`
BLAKE2b-256	`8144a1b70c6f67776b0e09816b5f4d75cb6b77c10d5b5a215e5e80b4c63810eb`

See more details on using hashes here.

File details

Details for the file leadforge-1.0.0-py3-none-any.whl.

File metadata

Download URL: leadforge-1.0.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 207.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for leadforge-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`610c3620ccbbc16106630798644e537329672726885f4f9d4920d49b8dcccf32`
MD5	`43626e27a8c62982f19b40f27150ee6a`
BLAKE2b-256	`eb8ac50f9be26315556f18c09a29fdfd0cbd34de6eb35475fa1b0d56088e3717`

See more details on using hashes here.

leadforge 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

leadforge

What Makes LeadForge Different

Installation

Quickstart

CLI

Python API

Generated Data Preview

Exposure Modes

Difficulty Profiles

Output Bundle

Key Design Principles

Documentation

Development

License

Credits

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes