Skip to main content

Population segmentation using CHAID decision trees with automatic predictor binning

Project description

CHAID Segmentation

A self-contained Python package for population segmentation using CHAID decision trees. You give it a KPI/target and a set of predictors; it auto-bins the continuous predictors, grows a CHAID tree, and hands back interpretable segments with an event rate (or mean), population, population share and lift — plus a static tree chart where each node is a population and each branch is a choice.

It is built for the workflow: "input a KPI, generate a tree, read off the high/low segments of the customer base." For example, for a 90+DPD target:

age < 25 AND region = Phnom Penh AND bank = ABA → 60% 90+DPD rate, 10% of population, 2.5x lift

The CHAID tree engine is bundled in this repository — there is no external CHAID dependency to install. It is based on the Rambatino/CHAID project (Apache 2.0); see Credits & License.

Features

  • Automatic binning of continuous predictors — supervised (target-based), equal-width, equal-frequency (quantile) or manual cut points, chosen per variable.
  • Binary and continuous KPIs — node rate is the event rate P(target = positive) for a binary target, or the mean for a continuous target.
  • Interpretable segments — every terminal node becomes a readable rule with rate, population, population share and lift.
  • Static visualisation (matplotlib + seaborn) — node = population, branch = choice, colour = rate.
  • Predict / score new data by re-applying the fitted bins and rules.
  • Load straight from CSV or Parquet with one call.

Installation

Requires Python 3.9+. Install from this repository:

pip install .                       # core (numpy, pandas, scipy, matplotlib, seaborn, ...)
pip install '.[segmenter-target]'   # + optbinning, for method="target" (supervised binning)
pip install '.[parquet]'            # + pyarrow, for ChaidSegmenter.from_parquet

matplotlib and seaborn are core dependencies (binning + plotting). optbinning and pyarrow are only needed for target-based binning and Parquet loading respectively, and are imported lazily with a clear error if missing.

How to use

import pandas as pd
from chaid_segmenter import ChaidSegmenter

df = pd.read_csv("loan_book.csv")     # columns: age, income, tenure, score, region, bank, dpd90

seg = ChaidSegmenter(
    target="dpd90",
    positive_class=1,                                    # binary event-rate target
    predictors={
        "age":    {"method": "target", "max_bins": 4},      # supervised (optbinning)
        "income": {"method": "equal_width", "bins": 4},     # fixed-interval bins
        "tenure": {"method": "equal_frequency", "bins": 4}, # quantile bins
        "score":  {"method": "manual", "edges": [550, 650, 750]},
        "region": {"method": "nominal"},                    # categorical, used as-is
        "bank":   {"method": "nominal"},
    },
    max_depth=3,
    min_child_node_size=0.02,         # int count, or a fraction of the dataset
    alpha_merge=0.05,
)
seg.fit(df)

seg.summary()                         # tidy DataFrame, highest rate first
seg.segments()                        # list[Segment]
seg.predict(df_new)                   # assign rows to terminal segments
seg.plot("tree.png")                  # static matplotlib/seaborn chart

Load and fit in a single call:

seg = ChaidSegmenter.from_csv("loan_book.csv", "dpd90", predictors, positive_class=1)
seg = ChaidSegmenter.from_parquet("loan_book.parquet", "dpd90", predictors, positive_class=1)

A runnable, self-contained demo lives at examples/dpd_segmentation.py.

You don't have to spell out every predictor

predictors accepts three forms — pick whichever is least effort:

# 1. Full control: a spec (or method string) per column
predictors={"age": {"method": "target", "max_bins": 4}, "region": "nominal"}

# 2. Just the column names — the method is inferred from each column's dtype
#    (numeric -> default_numeric_method, non-numeric -> nominal)
predictors=["age", "income", "region", "bank"]

# 3. Omit it entirely — auto-select every column except the target/weight
ChaidSegmenter(target="dpd90", positive_class=1).fit(df)

In full-auto mode, constant columns and high-cardinality text columns (IDs, names, free text — anything with more than max_nominal_cardinality distinct values) are skipped automatically. You can always mix inference with overrides — e.g. {"age": "auto", "score": {"method": "manual", "edges": [550, 650]}} — and inspect what was chosen via seg.resolved_predictors after fit. Inferred numeric columns use default_numeric_method (default "target", falling back gracefully if you prefer "equal_frequency"/"equal_width").

Expected output

seg.summary()

A pandas.DataFrame, one row per terminal segment, highest rate first:

 node_id                                                           description population population_pct  rate  lift
       2                                         age < 24.9859 AND score < 550        530           6.6% 48.3% 1.99x
       8                         bank = ABA AND age < 24.9859 AND score >= 550        188           2.4% 41.5% 1.71x
       4                         bank = ABA AND age >= 24.9859 AND score < 550      1,043          13.0% 37.9% 1.56x
       9             bank in {ACLEDA, Wing} AND age < 24.9859 AND score >= 550        410           5.1% 30.0% 1.24x
       5             bank in {ACLEDA, Wing} AND age >= 24.9859 AND score < 550      2,107          26.3% 25.1% 1.03x
      11             bank = ABA AND age in [24.9859, 66.5531) AND score >= 550      1,167          14.6% 22.4% 0.92x
      13                                       age >= 66.5531 AND score >= 550        282           3.5% 20.9% 0.86x
      12 bank in {ACLEDA, Wing} AND age in [24.9859, 66.5531) AND score >= 550      2,273          28.4% 10.7% 0.44x

seg.segments()

Each Segment is a small object you can read off directly:

top = seg.segments()[0]
top.description     # 'age < 24.9859 AND score < 550'
top.rate            # 0.4830...   (48.3% 90+DPD rate)
top.population      # 530.0
top.population_pct  # 0.06625     (6.6% of the book)
top.lift            # 1.99        (vs the 24.3% overall rate)
top.node_id         # 2
top.rules           # [{'variable': 'age', 'label': 'age < 24.9859', 'data': [...]},
                    #  {'variable': 'score', 'label': 'score < 550', 'data': [...]}]

seg.plot("tree.png")

Each node shows its population (count + % of total) and rate; each branch is labelled with the choice that leads into it; node colour encodes the rate:

CHAID segmentation tree

seg.predict(df, with_rate=True)

Assigns every row to its terminal segment (and, optionally, that segment's rate):

>>> seg.predict(df_new, with_rate=True).head()
   node_id      rate
0       11  0.223650
1        4  0.378715
2       12  0.106907
3        5  0.250593
4        4  0.378715

Rows that match no segment (e.g. an unseen category at predict time) come back as <NA>.

Binning methods

Each predictor's method selects how it is turned into branches:

method Spec keys Description
target max_bins Supervised optimal binning via optbinning — monotonic event rate. Works on numeric columns (rate-ordered ranges) and on high-cardinality categoricals (groups categories into rate tiers). Needs the segmenter-target extra.
equal_width bins Fixed-width intervals across the value range.
equal_frequency bins Quantile bins of roughly equal population.
manual edges User-supplied interior cut points.
nominal Categorical predictor, used as-is (no binning).

Continuous predictors are converted to ordinal bin codes, so only contiguous bins ever merge and every branch renders as a clean range (age < 25, in [25, 40), >= 40). Missing values become their own missing branch.

A spec may be written as a bare method string when it takes no options, e.g. "region": "nominal".

High-cardinality IDs (member / institution codes)

A categorical with hundreds of distinct values — a member id, institution code, merchant id — can't be used as plain nominal (CHAID would try to merge hundreds of categories: slow and unreadable). Use target to group it by the event rate into a few risk tiers:

predictors={"MEMBER_ID": "target", "PRODUCT_TYPE": "nominal", ...}

Segments then read like MEMBER_ID in {003, 014, 019, 024, 045, 048, +11} → 40.7% — a concrete high-risk member group. This also happens automatically when you simply list such a column (its cardinality exceeds max_nominal_cardinality); in full-auto mode (predictors omitted) high-cardinality columns are dropped instead, since the tool can't tell a meaningful id from a row identifier like ACCOUNT_ID.

Numeric-looking ids. Whether a column is binned as a number or grouped as a category is decided from its dtype. An id like 001..200 read from a CSV becomes int64, so it would be binned into meaningless ranges (MEMBER_ID < 100). Either cast it to text — df["MEMBER_ID"] = df["MEMBER_ID"].astype(str) (or read with pd.read_csv(..., dtype={"MEMBER_ID": str})) — or force grouping with {"method": "target", "categorical": True}.

Targets

  • Binary — pass positive_class (the event value, e.g. 1). Node rate is P(target == positive_class) and lift is rate / overall_rate.
  • Continuous — omit positive_class. Node rate is the mean of the target and lift is mean / overall_mean.

API reference

ChaidSegmenter(...)

Parameter Default Description
target Name of the KPI/target column.
predictors None A {column: spec} dict, a list of column names (methods inferred from dtype), or None for full auto-select. See Binning methods and above.
positive_class None Event value for a binary target; None ⇒ continuous target.
default_numeric_method "target" Binning method for auto-inferred numeric predictors (target / equal_frequency / equal_width).
default_bins 5 Bin count for auto-inferred numeric predictors.
max_nominal_cardinality 20 In full-auto mode, non-numeric columns with more distinct values are skipped.
max_depth 3 Maximum tree depth.
min_child_node_size 30 Minimum observations per child. Values in (0, 1) are treated as fractions of the dataset.
min_parent_node_size None Minimum observations to split a node; defaults to min_child_node_size. Fractions supported.
alpha_merge 0.05 Significance threshold for merging predictor categories.
split_threshold 0 Surrogate-split threshold (passed through to the tree engine).
max_splits None Maximum number of children per split.
weight None Optional weight column; populations and rates use weighted sums.

Methods

  • fit(df) — fit the binners and grow the tree from a pandas.DataFrame.
  • segments(sort_by_rate=True) — list of Segment objects.
  • summary() — segments as a tidy DataFrame.
  • segment_rates{node_id: rate} for the terminal nodes.
  • predict(df, with_rate=False) — terminal node_id per row (optionally with rate).
  • plot(path=None, **kwargs) — render the tree; returns the matplotlib figure and writes to path if given. Accepts figsize, cmap, dpi, font-size overrides.
  • ChaidSegmenter.from_csv(path, target, predictors, *, read_csv_kwargs=None, **kwargs) and from_parquet(...) — load a file, construct and fit in one call.

Segment

node_id, description, rate, population, population_pct, lift, and a structured rules list of {variable, label, data}.

How it works

ChaidSegmenter fits a Binner per continuous predictor and feeds the resulting integer bin codes (as ordinal columns) plus the nominal predictors into a bundled CHAID tree engine, which splits on the predictor most strongly associated with the target (chi-squared for categorical targets, Bartlett's/Levene's test for continuous targets). Because the bins enter as contiguous ordinal codes, merged groups always describe a single, readable range. Terminal nodes are then translated back into rate/population/lift segments.

Low-level tree engine

The underlying CHAID tree implementation is bundled and importable as CHAID (from CHAID import Tree) for advanced use — building a tree by hand, exporting classification_rules(), treelib conversion, etc. The segmentation API above is the recommended entry point for the KPI-segmentation workflow.

Testing

pip install -e '.[segmenter-target,parquet,test]'
pytest tests/

Credits & License

This project is built on top of CHAID by Mark Ramotowski, Richard Fitzgerald and contributors. The CHAID/ package in this repository is that upstream implementation, bundled unmodified as the underlying tree engine. The chaid_segmenter/ package — automatic binning, KPI segmentation and the matplotlib/seaborn visualisation — is an original addition.

Distributed under the Apache License 2.0. See LICENSE.txt for the full license text and NOTICE for attribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chaid_segmenter-0.1.1.tar.gz (62.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chaid_segmenter-0.1.1-py3-none-any.whl (43.8 kB view details)

Uploaded Python 3

File details

Details for the file chaid_segmenter-0.1.1.tar.gz.

File metadata

  • Download URL: chaid_segmenter-0.1.1.tar.gz
  • Upload date:
  • Size: 62.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for chaid_segmenter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 22e8d319b2114b1ca9f2d0d0bd3584dc5a8cbae45422abbc72fc49c5ef53e96e
MD5 25e054e76fef5e64543cb8dd81c3f6ae
BLAKE2b-256 ed7c984e622b3680cefba62df342bfb4a37a229076f32196c27606df42edc0e9

See more details on using hashes here.

File details

Details for the file chaid_segmenter-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for chaid_segmenter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b0337b8aaf03c10efb389f7c75b56981ef0f891fd9b3e6f1fab058c7aeb30631
MD5 165bff81334f43e2e6b016075b943e43
BLAKE2b-256 cc28968336467c4d31f7a91ca008c474c660c4f7c08c169e50087eda57e783ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page