Skip to main content

Unified CPU/GPU statistical categorical encoding: leakage-safe target encoding generalized to arbitrary statistics, with one sklearn-compatible API.

Project description

catstat

CI PyPI Python License: MIT

Unified CPU/GPU statistical categorical encoding: leakage-safe target encoding generalized to arbitrary statistics, behind one scikit-learn-compatible API.

Runs on CPU (pandas/numpy) today. The GPU path (cuDF/CuPy) is parity-validated (CPU/GPU allclose) but not yet faster than CPU up to ~1M rows, so backend="auto" resolves to CPU; explicit backend="gpu" is available for device-resident pipelines and larger data. See docs/roadmap.md and docs/known_issues.md (KI-020).

Install

pip install catstat

Optional extras: catstat[gpu] (RAPIDS cuDF/CuPy, CUDA 12), catstat[polars] (output="polars"), catstat[docs] (API-reference build), catstat[dev] (tests + lint + build).

Quickstart

from catstat import TargetEncoder, CountEncoder, FrequencyEncoder

enc = TargetEncoder(cols="auto", stats=["mean"], smooth="auto", cv=5, random_state=42)
X_train_enc = enc.fit_transform(X_train, y_train)   # out-of-fold (leakage-safe)
X_test_enc  = enc.transform(X_test)                 # full-data encodings for new data

Why catstat

sklearn's TargetEncoder is CPU and mean-only; cuML is GPU-only (RAPIDS-locked, few stats); category_encoders has no internal cross-fitting (leakage risk). catstat is the union: one API, CPU today and GPU when it pays off, generalized statistics, always leakage-safe.

What it encodes

Three encoders over a shared core: TargetEncoder (supervised, cross-fitted) and the unsupervised CountEncoder / FrequencyEncoder. TargetEncoder(stats=[...]) selects the statistics to emit:

stats= entry smoothing target GPU column infix
"mean" m-estimate (fixed) / empirical-Bayes (smooth="auto") regression / binary / multiclass te_mean
"count" unsupervised count
"frequency" unsupervised freq
"var", "std" — (global fallback) regression te_var, te_std
"median", "min", "max" — (global fallback) regression te_median / te_min / te_max
"skew" — (global fallback) regression CPU only te_skew
("name", callable) — custom (quantiles, IQR, …) — (global fallback) regression CPU only name

Smoothing honesty: only mean/probability statistics are smoothed. Count/frequency get none; order/shape statistics never blend — below min_samples_category (or where undefined) they fall back to the global statistic. (stats=["quantile"] raises with a hint to pass a custom callable such as ("q90", lambda v: np.quantile(v, 0.9)).)

Other knobs: scheme ∈ {kfold, loo, ordered} (cross-fitting for the mean; loo/ordered are mean-only), multi_feature_mode ∈ {independent, combination} (joint group-by), handle_unknown / handle_missing ∈ {value, return_nan, error}, backend ∈ {auto, cpu, gpu}, and output ∈ {auto, numpy, pandas, polars}.

Leakage-safe by design

  • fit_transform(X, y) is out-of-fold: each fold is encoded from its complement, then the encoder refits on the full data for later transform of new rows. fit(X, y).transform(X) on the training set is the leaky path and is documented as such.
  • smooth="auto" variance is computed per fold; folds flow only through random_state (catstat owns fold assignment, so CPU and GPU produce the same encodings — asserted allclose).
  • Deterministic given random_state.

scikit-learn compatibility

BaseEstimator / TransformerMixin; works in Pipeline and ColumnTransformer, supports set_output(transform="pandas"|"polars") and get_feature_names_out. The supported subset of sklearn.utils.estimator_checks.check_estimator is documented and tested (see docs/known_issues.md, KI-012).

API reference

Rendered API docs: https://matapanino.github.io/catstat/ (built with pdoc; see scripts/build_docs.sh).

Develop

pip install -e ".[dev]"
bash scripts/check.sh        # ruff + pytest + examples (the green gate)
PYTHONPATH=src python3 -m pytest tests/ -q
PYTHONPATH=src python3 -m benchmarks.run_benchmarks --size small --backend cpu --reps 5 \
    --out benchmarks/results/run.json

See CLAUDE.md for the development rules and docs/ for the design.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catstat-0.3.0.tar.gz (148.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catstat-0.3.0-py3-none-any.whl (37.5 kB view details)

Uploaded Python 3

File details

Details for the file catstat-0.3.0.tar.gz.

File metadata

  • Download URL: catstat-0.3.0.tar.gz
  • Upload date:
  • Size: 148.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for catstat-0.3.0.tar.gz
Algorithm Hash digest
SHA256 70f3e6e9e2b415fd121c73aa3030e9576ca97fbaaa70e4c2b68b41838267009f
MD5 d3df05140ac7a21f19fef2a9f110ea73
BLAKE2b-256 0d03f725670825b2d6685cae2b13a8d9222813aafd4637e288951a8782ec6fbc

See more details on using hashes here.

Provenance

The following attestation bundles were made for catstat-0.3.0.tar.gz:

Publisher: release.yml on Matapanino/catstat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file catstat-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: catstat-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 37.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for catstat-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf31e5c6e915a6739e696f4bf6ffbd0d8f7d3b90d8ccee476108ee0b43da0b46
MD5 92a281a4b529dd26fba2aa0d4937f8b4
BLAKE2b-256 0f953852a53ecbe8bb3f7281ff1cf0ba0061b191f6cd6b4b3b4aea6b75a889c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for catstat-0.3.0-py3-none-any.whl:

Publisher: release.yml on Matapanino/catstat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page