Population segmentation using CHAID decision trees with automatic predictor binning

These details have not been verified by PyPI

Project links

Homepage

Project description

CHAID Segmentation

A self-contained Python package for population segmentation using CHAID decision trees. You give it a KPI/target and a set of predictors; it auto-bins the continuous predictors, grows a CHAID tree, and hands back interpretable segments with an event rate (or mean), population, population share and lift — plus a static tree chart where each node is a population and each branch is a choice.

It is built for the workflow: "input a KPI, generate a tree, read off the high/low segments of the customer base." For example, for a 90+DPD target:

age < 25 AND region = Phnom Penh AND bank = ABA → 60% 90+DPD rate, 10% of population, 2.5x lift

The CHAID tree engine is bundled in this repository — there is no external CHAID dependency to install. It is based on the Rambatino/CHAID project (Apache 2.0); see Credits & License.

Features

Automatic binning of continuous predictors — supervised (target-based), equal-width, equal-frequency (quantile) or manual cut points, chosen per variable.
Binary and continuous KPIs — node rate is the event rate P(target = positive) for a binary target, or the mean for a continuous target.
Interpretable segments — every terminal node becomes a readable rule with rate, population, population share and lift.
Static visualisation (matplotlib + seaborn) — node = population, branch = choice, colour = rate.
Predict / score new data by re-applying the fitted bins and rules.
Load straight from CSV or Parquet with one call.

Installation

Requires Python 3.9+. Install from this repository:

pip install .                       # core (numpy, pandas, scipy, matplotlib, seaborn, ...)
pip install '.[segmenter-target]'   # + optbinning, for method="target" (supervised binning)
pip install '.[parquet]'            # + pyarrow, for ChaidSegmenter.from_parquet

matplotlib and seaborn are core dependencies (binning + plotting). optbinning and pyarrow are only needed for target-based binning and Parquet loading respectively, and are imported lazily with a clear error if missing.

How to use

import pandas as pd
from chaid_segmenter import ChaidSegmenter

df = pd.read_csv("loan_book.csv")     # columns: age, income, tenure, score, region, bank, dpd90

seg = ChaidSegmenter(
    target="dpd90",
    positive_class=1,                                    # binary event-rate target
    predictors={
        "age":    {"method": "target", "max_bins": 4},      # supervised (optbinning)
        "income": {"method": "equal_width", "bins": 4},     # fixed-interval bins
        "tenure": {"method": "equal_frequency", "bins": 4}, # quantile bins
        "score":  {"method": "manual", "edges": [550, 650, 750]},
        "region": {"method": "nominal"},                    # categorical, used as-is
        "bank":   {"method": "nominal"},
    },
    max_depth=3,
    min_child_node_size=0.02,         # int count, or a fraction of the dataset
    alpha_merge=0.05,
)
seg.fit(df)

seg.summary()                         # tidy DataFrame, highest rate first
seg.segments()                        # list[Segment]
seg.predict(df_new)                   # assign rows to terminal segments
seg.plot("tree.png")                  # static matplotlib/seaborn chart

Load and fit in a single call:

seg = ChaidSegmenter.from_csv("loan_book.csv", "dpd90", predictors, positive_class=1)
seg = ChaidSegmenter.from_parquet("loan_book.parquet", "dpd90", predictors, positive_class=1)

A runnable, self-contained demo lives at examples/dpd_segmentation.py.

You don't have to spell out every predictor

predictors accepts three forms — pick whichever is least effort:

# 1. Full control: a spec (or method string) per column
predictors={"age": {"method": "target", "max_bins": 4}, "region": "nominal"}

# 2. Just the column names — the method is inferred from each column's dtype
#    (numeric -> default_numeric_method, non-numeric -> nominal)
predictors=["age", "income", "region", "bank"]

# 3. Omit it entirely — auto-select every column except the target/weight
ChaidSegmenter(target="dpd90", positive_class=1).fit(df)

In full-auto mode, constant columns and high-cardinality text columns (IDs, names, free text — anything with more than max_nominal_cardinality distinct values) are skipped automatically. You can always mix inference with overrides — e.g. {"age": "auto", "score": {"method": "manual", "edges": [550, 650]}} — and inspect what was chosen via seg.resolved_predictors after fit. Inferred numeric columns use default_numeric_method (default "target", falling back gracefully if you prefer "equal_frequency"/"equal_width").

Expected output

`seg.summary()`

A pandas.DataFrame, one row per terminal segment, highest rate first:

 node_id                                                           description population population_pct  rate  lift
       2                                         age < 24.9859 AND score < 550        530           6.6% 48.3% 1.99x
       8                         bank = ABA AND age < 24.9859 AND score >= 550        188           2.4% 41.5% 1.71x
       4                         bank = ABA AND age >= 24.9859 AND score < 550      1,043          13.0% 37.9% 1.56x
       9             bank in {ACLEDA, Wing} AND age < 24.9859 AND score >= 550        410           5.1% 30.0% 1.24x
       5             bank in {ACLEDA, Wing} AND age >= 24.9859 AND score < 550      2,107          26.3% 25.1% 1.03x
      11             bank = ABA AND age in [24.9859, 66.5531) AND score >= 550      1,167          14.6% 22.4% 0.92x
      13                                       age >= 66.5531 AND score >= 550        282           3.5% 20.9% 0.86x
      12 bank in {ACLEDA, Wing} AND age in [24.9859, 66.5531) AND score >= 550      2,273          28.4% 10.7% 0.44x

`seg.segments()`

Each Segment is a small object you can read off directly:

top = seg.segments()[0]
top.description     # 'age < 24.9859 AND score < 550'
top.rate            # 0.4830...   (48.3% 90+DPD rate)
top.population      # 530.0
top.population_pct  # 0.06625     (6.6% of the book)
top.lift            # 1.99        (vs the 24.3% overall rate)
top.node_id         # 2
top.rules           # [{'variable': 'age', 'label': 'age < 24.9859', 'data': [...]},
                    #  {'variable': 'score', 'label': 'score < 550', 'data': [...]}]

`seg.plot("tree.png")`

Each node shows its population (count + % of total) and rate; each branch is labelled with the choice that leads into it; node colour encodes the rate:

CHAID segmentation tree

`seg.predict(df, with_rate=True)`

Assigns every row to its terminal segment (and, optionally, that segment's rate):

>>> seg.predict(df_new, with_rate=True).head()
   node_id      rate
0       11  0.223650
1        4  0.378715
2       12  0.106907
3        5  0.250593
4        4  0.378715

Rows that match no segment (e.g. an unseen category at predict time) come back as <NA>.

Binning methods

Each predictor's method selects how it is turned into branches:

`method`	Spec keys	Description
`target`	`max_bins`	Supervised optimal binning via optbinning — monotonic event rate. Works on numeric columns (rate-ordered ranges) and on high-cardinality categoricals (groups categories into rate tiers). Needs the `segmenter-target` extra.
`equal_width`	`bins`	Fixed-width intervals across the value range.
`equal_frequency`	`bins`	Quantile bins of roughly equal population.
`manual`	`edges`	User-supplied interior cut points.
`nominal`	—	Categorical predictor, used as-is (no binning).

Continuous predictors are converted to ordinal bin codes, so only contiguous bins ever merge and every branch renders as a clean range (age < 25, in [25, 40), >= 40). Missing values become their own missing branch.

A spec may be written as a bare method string when it takes no options, e.g. "region": "nominal".

High-cardinality IDs (member / institution codes)

A categorical with hundreds of distinct values — a member id, institution code, merchant id — can't be used as plain nominal (CHAID would try to merge hundreds of categories: slow and unreadable). Use target to group it by the event rate into a few risk tiers:

predictors={"MEMBER_ID": "target", "PRODUCT_TYPE": "nominal", ...}

Segments then read like MEMBER_ID in {003, 014, 019, 024, 045, 048, +11} → 40.7% — a concrete high-risk member group. This also happens automatically when you simply list such a column (its cardinality exceeds max_nominal_cardinality); in full-auto mode (predictors omitted) high-cardinality columns are dropped instead, since the tool can't tell a meaningful id from a row identifier like ACCOUNT_ID.

Numeric-looking ids. Whether a column is binned as a number or grouped as a category is decided from its dtype. An id like 001..200 read from a CSV becomes int64, so it would be binned into meaningless ranges (MEMBER_ID < 100). Either cast it to text — df["MEMBER_ID"] = df["MEMBER_ID"].astype(str) (or read with pd.read_csv(..., dtype={"MEMBER_ID": str})) — or force grouping with {"method": "target", "categorical": True}.

Targets

Binary — pass positive_class (the event value, e.g. 1). Node rate is P(target == positive_class) and lift is rate / overall_rate.
Continuous — omit positive_class. Node rate is the mean of the target and lift is mean / overall_mean.

API reference

`ChaidSegmenter(...)`

Parameter	Default	Description
`target`	—	Name of the KPI/target column.
`predictors`	`None`	A `{column: spec}` dict, a list of column names (methods inferred from dtype), or `None` for full auto-select. See Binning methods and above.
`positive_class`	`None`	Event value for a binary target; `None` ⇒ continuous target.
`default_numeric_method`	`"target"`	Binning method for auto-inferred numeric predictors (`target` / `equal_frequency` / `equal_width`).
`default_bins`	`5`	Bin count for auto-inferred numeric predictors.
`max_nominal_cardinality`	`20`	In full-auto mode, non-numeric columns with more distinct values are skipped.
`max_depth`	`3`	Maximum tree depth.
`min_child_node_size`	`30`	Minimum observations per child. Values in `(0, 1)` are treated as fractions of the dataset.
`min_parent_node_size`	`None`	Minimum observations to split a node; defaults to `min_child_node_size`. Fractions supported.
`alpha_merge`	`0.05`	Significance threshold for merging predictor categories.
`split_threshold`	`0`	Surrogate-split threshold (passed through to the tree engine).
`max_splits`	`None`	Maximum number of children per split.
`weight`	`None`	Optional weight column; populations and rates use weighted sums.

Methods

fit(df) — fit the binners and grow the tree from a pandas.DataFrame.
segments(sort_by_rate=True) — list of Segment objects.
summary() — segments as a tidy DataFrame.
segment_rates — {node_id: rate} for the terminal nodes.
predict(df, with_rate=False) — terminal node_id per row (optionally with rate).
plot(path=None, **kwargs) — render the tree; returns the matplotlib figure and writes to path if given. Accepts figsize, cmap, dpi, font-size overrides.
ChaidSegmenter.from_csv(path, target, predictors, *, read_csv_kwargs=None, **kwargs) and from_parquet(...) — load a file, construct and fit in one call.

`Segment`

node_id, description, rate, population, population_pct, lift, and a structured rules list of {variable, label, data}.

How it works

ChaidSegmenter fits a Binner per continuous predictor and feeds the resulting integer bin codes (as ordinal columns) plus the nominal predictors into a bundled CHAID tree engine, which splits on the predictor most strongly associated with the target (chi-squared for categorical targets, Bartlett's/Levene's test for continuous targets). Because the bins enter as contiguous ordinal codes, merged groups always describe a single, readable range. Terminal nodes are then translated back into rate/population/lift segments.

Low-level tree engine

The underlying CHAID tree implementation is bundled and importable as CHAID (from CHAID import Tree) for advanced use — building a tree by hand, exporting classification_rules(), treelib conversion, etc. The segmentation API above is the recommended entry point for the KPI-segmentation workflow.

Testing

pip install -e '.[segmenter-target,parquet,test]'
pytest tests/

Credits & License

This project is built on top of CHAID by Mark Ramotowski, Richard Fitzgerald and contributors. The CHAID/ package in this repository is that upstream implementation, bundled unmodified as the underlying tree engine. The chaid_segmenter/ package — automatic binning, KPI segmentation and the matplotlib/seaborn visualisation — is an original addition.

Distributed under the Apache License 2.0. See LICENSE.txt for the full license text and NOTICE for attribution.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.3

Jul 3, 2026

0.1.2

Jun 26, 2026

This version

0.1.1

Jun 26, 2026

0.1.0

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chaid_segmenter-0.1.1.tar.gz (62.8 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chaid_segmenter-0.1.1-py3-none-any.whl (43.8 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file chaid_segmenter-0.1.1.tar.gz.

File metadata

Download URL: chaid_segmenter-0.1.1.tar.gz
Upload date: Jun 26, 2026
Size: 62.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for chaid_segmenter-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`22e8d319b2114b1ca9f2d0d0bd3584dc5a8cbae45422abbc72fc49c5ef53e96e`
MD5	`25e054e76fef5e64543cb8dd81c3f6ae`
BLAKE2b-256	`ed7c984e622b3680cefba62df342bfb4a37a229076f32196c27606df42edc0e9`

See more details on using hashes here.

File details

Details for the file chaid_segmenter-0.1.1-py3-none-any.whl.

File metadata

Download URL: chaid_segmenter-0.1.1-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 43.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for chaid_segmenter-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0337b8aaf03c10efb389f7c75b56981ef0f891fd9b3e6f1fab058c7aeb30631`
MD5	`165bff81334f43e2e6b016075b943e43`
BLAKE2b-256	`cc28968336467c4d31f7a91ca008c474c660c4f7c08c169e50087eda57e783ec`

See more details on using hashes here.

chaid-segmenter 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CHAID Segmentation

Features

Installation

How to use

You don't have to spell out every predictor

Expected output

seg.summary()

seg.segments()

seg.plot("tree.png")

seg.predict(df, with_rate=True)

Binning methods

High-cardinality IDs (member / institution codes)

Targets

API reference

ChaidSegmenter(...)

Methods

Segment

How it works

Low-level tree engine

Testing

Credits & License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`seg.summary()`

`seg.segments()`

`seg.plot("tree.png")`

`seg.predict(df, with_rate=True)`

`ChaidSegmenter(...)`

`Segment`