Skip to main content

Structured ML framework for customer churn prediction -- from exploration notebooks to production pipelines, locally or on Databricks.

Project description

Customer Retention ML Framework

A structured backbone for the messy, iterative reality of ML model development. Exploration and production deployment are treated as parts of the same process -- not separate phases -- reflecting how data science actually works: you explore, decide, build, evaluate, learn something new, and circle back.

Handles both entity-level and event-level datasets. Experiments and production can share the same tables without copying data (Delta Lake), features are served consistently across training and inference (Feast / Feature Store), and every experiment is tracked and reproducible (MLflow). Runs locally or deploys to Databricks.

Python 3.10+ License CI codecov pre-commit Typed MLflow Databricks


Why This Exists

Most ML tutorials jump straight to model.fit(). Real projects fail earlier -- in data issues you didn't notice, leakage you didn't check for, or feature choices you can't explain to your stakeholders three months later. This framework tries to close that gap.

It serves two audiences:

  1. If you're learning, the notebooks walk through a realistic end-to-end process and explain the reasoning behind each step. Why does a 93-day median inter-event gap rule out short aggregation windows? Why might the model that wins validation degrade in production? The goal is to build intuition for the decisions that don't appear in textbooks.

  2. If you're experienced, you can pip install, point to a new dataset, and get an opinionated exploration scaffold. The output is loosely-coupled production code (Bronze / Silver / Gold) with the provenance of every decision captured in self-contained HTML documentation -- useful when you need to explain why the pipeline does what it does.

The approach

  • Exploration is a first-class concept. The framework records what it found in the data, what it recommends, and why -- in versioned YAML artifacts. Each downstream transformation traces back to a specific observation in a specific notebook, so nothing happens without a documented reason.
  • Experimentation is version-controlled end to end. Not just code and features, but the actual data observations and actions taken on them can be frozen in time together. Delta tables support time-travel on live production datasets, so you can always go back to what the data looked like when a decision was made.
  • Iteration is the default. Model feedback -- feature importances, error analysis, drift signals -- feeds back into the next exploration cycle. The framework tracks iteration lineage rather than treating each experiment as independent.

Quick Start

1. Install (local)

pip install "churnkit[ml]"

For Databricks, see the Databricks Installation guide.

2. Bootstrap notebooks into your project

churnkit-init --output ./my_project
cd my_project

3. Point to your data

Open exploration_notebooks/01_data_discovery.ipynb and set the data path:

DATA_PATH = "experiments/data/your_file.csv"   # csv, parquet, or delta

4. Run

Execute cells sequentially. The framework auto-detects column types, data granularity (entity vs event-level), text columns, and temporal patterns -- then routes you through the relevant notebooks.

Findings, recommendations, and production pipeline specs are generated as you go.


Learn More

Detailed documentation lives in the Wiki:

Topic Wiki Page
Installation options & environment setup Getting Started
Databricks install & databricks_init() setup Databricks Installation
Medallion architecture & system design Architecture
Notebook workflow & iteration tracking Exploration Loop
Leakage-safe temporal data preparation Temporal Framework
Feast & Databricks feature management Feature Store
Local execution with Feast + MLFlow Local Track
Databricks with Unity Catalog + Delta Lake Databricks Track

Tutorials

Tutorial What it walks through
Retail Customer Retention Entity-level data: point-in-time snapshots, quality assessment, baseline models, and a production scoring check that reveals how distribution drift affects different model families -- browse HTML
Customer Email Engagement Event-level data: temporal window selection driven by inter-event cadence, aggregating 83K email events into customer-level features, and tracing each decision from data observation to production pipeline -- browse HTML
Bank Customer Churn Dataset setup instructions
Netflix Churn Dataset setup instructions

Acknowledgments


Contributing

See CONTRIBUTING.md for guidelines.

License

Apache 2.0 -- See LICENSE for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

churnkit-0.77.0a15.tar.gz (686.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

churnkit-0.77.0a15-py3-none-any.whl (818.0 kB view details)

Uploaded Python 3

File details

Details for the file churnkit-0.77.0a15.tar.gz.

File metadata

  • Download URL: churnkit-0.77.0a15.tar.gz
  • Upload date:
  • Size: 686.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for churnkit-0.77.0a15.tar.gz
Algorithm Hash digest
SHA256 3be7312e8fc5b2c9062dc2dbcde2f24214d5c60e4491e25a39680da2d1b2a44b
MD5 e9ea58444e6f73c823ba6ec1c3757518
BLAKE2b-256 e4a4bfee92c050e4f0169cce77cbef5a2fc8d1d79f4e5a2cf96252a33028c5e5

See more details on using hashes here.

File details

Details for the file churnkit-0.77.0a15-py3-none-any.whl.

File metadata

  • Download URL: churnkit-0.77.0a15-py3-none-any.whl
  • Upload date:
  • Size: 818.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for churnkit-0.77.0a15-py3-none-any.whl
Algorithm Hash digest
SHA256 914a8c268057f36dabb41797d7605a3379fe211b1a9030966eb1ad2f47c7abf5
MD5 2dc943f36eb203bb55e122a5f7e9bf2f
BLAKE2b-256 86ab36a10f0a0d30d2399cc2eeca0b8e9fc82786cc0c7a9562e92c53e7e72ff6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page