Skip to main content

Automatic Discretization of Features with Optimal Target Association

Project description

AutoCarver Logo

PyPI Python License SPEC 0 Docs Tests Coverage

AutoCarver in one loop: discretize, rank groupings, carve

AutoCarver automates supervised feature discretization (binning) to maximize statistical association with your target — using Tschuprow's T or Cramér's V — and validates the chosen bins against a held-out dev set. It supports binary classification, multiclass classification, and regression, and is widely used for credit scoring, fraud detection, and risk modeling.

🆕 What's New

🤖 LLM & MCP integration. AutoCarver now ships a local Model Context Protocol server: point an MCP-aware assistant (VS Code Copilot, Claude Desktop, Cursor, …) at a data file and let it qualify the columns and carve them against your target through tool calls. The server runs fully on your machine — your dataset is never sent to AutoCarver or any external service (only your own LLM provider sees what the assistant shares). Carving quality depends on the LLM, so have a human confirm the feature definitions before production use. See the LLM & MCP guide.

pip install "autocarver[mcp]"

Install

pip install autocarver

Quick Start

Binary classification on the Titanic dataset:

import pandas as pd
from sklearn.model_selection import train_test_split

from AutoCarver import BinaryCarver, Features

# 1. Load data
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"

# 2. Train / dev split, stratified on the target
train, dev = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

# 3. Declare features by type
features = Features(
    categoricals=["Sex"],
    numericals=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
    ordinals={"Pclass": ["1", "2", "3"]},
)

# 4. Fit the carver (dev set drives the robustness checks)
carver = BinaryCarver(features=features, min_freq=0.05, max_n_mod=5)
train_processed = carver.fit_transform(train, train[target], X_dev=dev, y_dev=dev[target])
dev_processed = carver.transform(dev)

# 5. Inspect the carved buckets, target rate, and association
print(carver.summary)

# 6. Persist for later use
carver.save("titanic_carver.json")
# carver = BinaryCarver.load("titanic_carver.json")

For multiclass classification use MulticlassCarver; for regression use ContinuousCarver — the API is identical. To pre-select features by target association and inter-feature redundancy, pipe the carved output through ClassificationSelector or RegressionSelector.

Why AutoCarver?

  • Optimal supervised binning — exhaustive search over admissible bin combinations maximizes Tschuprow's T (default) or Cramér's V. For fixed min_freq, max_n_mod and metric, no other combination scores higher.
  • Robust to data drift — every candidate bin combination is validated on a dev set, rejecting any whose target rates flip or whose buckets fall below min_freq.
  • First-class ordinal featuresOrdinalDiscretizer enforces your declared modality order, so under-represented levels are merged with their nearest neighbour instead of being collapsed by frequency.
  • Inspect what was carvedfeatures.summary and features.history give you the bin definitions, per-bin target rate / frequency, and the full carving trace right off the fitted carver.
  • Interpretable buckets — human-readable boundaries you can audit, document, and ship to a scorecard.
  • Dimensionality reduction — groups under-represented modalities and caps bins per feature (max_n_mod), which is especially useful before one-hot encoding.
  • Feature pre-selectionClassificationSelector / RegressionSelector rank features by target association and filter on inter-feature correlation.

How does it compare?

AutoCarver optbinning sklearn KBinsDiscretizer
Supervised (uses y) yes yes no
Algorithm exhaustive search over admissible combinations mixed-integer program (CBC) quantile / uniform / k-means
Optimality for given min_freq / max_n_mod / metric guaranteed — best of every admissible combination provably optimal under MIP constraints n/a — no target objective
Target types binary, multiclass, continuous binary, multiclass, continuous n/a
Numeric and categorical and ordinal in one fit yes one binner per feature numeric only
Ordinal features with enforced order yes — OrdinalDiscretizer preserves your declared order via user_splits workaround (loses ordering) no
NaN handled as its own modality yes yes no (raises)
Held-out dev-set robustness check yes — built into fit no (script CV yourself) no
Per-bin stats + carving history after fit features.summary, features.history binning_table no
JSON round-trip persistence yes (carver.save("...json")) via pickle via pickle
sklearn Pipeline compatible yes yes yes
Feature pre-selection helpers ClassificationSelector, RegressionSelector no no

Side-by-side runnable snippets and a "when to pick which" guide live on the comparison page.

Documentation

Full reference, tutorials, and end-to-end notebook examples on ReadTheDocs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocarver-7.3.4.tar.gz (130.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autocarver-7.3.4-py3-none-any.whl (175.4 kB view details)

Uploaded Python 3

File details

Details for the file autocarver-7.3.4.tar.gz.

File metadata

  • Download URL: autocarver-7.3.4.tar.gz
  • Upload date:
  • Size: 130.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autocarver-7.3.4.tar.gz
Algorithm Hash digest
SHA256 dc53e37b864f05d021917ba66c6d923c643d6f068e3311a23d57d7feab301325
MD5 2a7f4b3c2dc4a04c8e177815aaf13012
BLAKE2b-256 29874c64c0beaae0539b8e6653359046721b647f8ec7a21a9a016d356c4091b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocarver-7.3.4.tar.gz:

Publisher: release.yml on mdefrance/AutoCarver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autocarver-7.3.4-py3-none-any.whl.

File metadata

  • Download URL: autocarver-7.3.4-py3-none-any.whl
  • Upload date:
  • Size: 175.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autocarver-7.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5f0994c1726e63e1e9c9016925997f884df4b6d3581f0c17fb9fcc45b701fe81
MD5 19bc6e5be093f2c224a2df57200cb6be
BLAKE2b-256 e2622590009d98e303a394f060387126b0f311c1f56a5c9d76bbad7075e1cb9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocarver-7.3.4-py3-none-any.whl:

Publisher: release.yml on mdefrance/AutoCarver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page