Scikit-learn compatible Profile Correspondence Encoder for high-cardinality categorical features
Project description
profile-correspondence-encoder
profile-correspondence-encoder provides a scikit-learn compatible PCEEncoder for high-cardinality categorical data.
It is inspired by MCA/correspondence analysis, but fits on a sparse category co-occurrence graph and outputs dense latent vectors.
Install
pip install profile-correspondence-encoder
Why Use It
- Unsupervised (no target leakage when fit on train only)
- pandas-friendly
- No one-hot explosion in output dimensionality
- Scalable sparse fit over category graph
- Canonical normalization + optional aliases
- Frequency-based shrinkage for rare categories (instead of hard bucketing)
Quick Start
import pandas as pd
from profile_correspondence_encoder import PCEEncoder
df = pd.DataFrame({
"city": ["New York", "NY", "new-york", "Los Angeles", "LA", None],
"state": ["New York", "New York", "New York", "California", "California", None],
"segment": ["A", "A", "A", "B", "B", "B"],
})
encoder = PCEEncoder(
columns=["city", "state", "segment"],
n_components=3,
min_frequency=2,
aliases={"city": {"ny": "new york", "la": "los angeles"}},
output_format="pandas",
)
X_enc = encoder.fit_transform(df)
print(X_enc.head())
print(encoder.get_metadata().head())
Scikit-learn Pipeline Example (Titanic)
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from profile_correspondence_encoder import PCEEncoder
# Public Titanic dataset via OpenML
data = fetch_openml(name="titanic", version=1, as_frame=True)
X = data.data
y = (data.target == "1").astype(int)
cat_cols = ["sex", "embarked", "pclass"]
num_cols = ["age", "fare", "sibsp", "parch"]
preprocess = ColumnTransformer(
transformers=[
(
"cat",
Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
(
"pce",
PCEEncoder(
columns=cat_cols,
n_components=4,
min_frequency=10,
output_format="numpy",
),
),
]
),
cat_cols,
),
(
"num",
Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))]),
num_cols,
),
],
sparse_threshold=0.0,
)
model = Pipeline(
steps=[
("prep", preprocess),
("clf", LogisticRegression(max_iter=2000)),
]
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]
print("ROC-AUC:", roc_auc_score(y_test, proba))
Rare Category Shrinkage
Instead of mapping every rare category to the exact same vector, PCEEncoder shrinks seen-rare categories according to their fit frequency:
count >= min_frequency: direct category vector0 < count < min_frequency: smooth interpolation between__UNK__and__RARE__- unseen category:
__UNK__
This keeps behavior stable while preserving some frequency signal among rare labels.
Development
pip install -e .[dev]
pytest
python -m build
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file profile_correspondence_encoder-0.1.0.tar.gz.
File metadata
- Download URL: profile_correspondence_encoder-0.1.0.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9afddde8e3572da0a0f9a0a51452e618ee432f1e08510437d62a12759c59af63
|
|
| MD5 |
4d62587c874efe9a3f455c1bca152ebc
|
|
| BLAKE2b-256 |
0b0fb0abc7e085eebfeacf519d0e38c367271c367e455d272ff71b4fa919a96b
|
File details
Details for the file profile_correspondence_encoder-0.1.0-py3-none-any.whl.
File metadata
- Download URL: profile_correspondence_encoder-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9b8df8b3ac45d61332288c5ea034e02fe6d38dc829d8baedd7ff33abb25eaa0
|
|
| MD5 |
6218e62ca480791157ca633c570e36e6
|
|
| BLAKE2b-256 |
a2619c4051260dbd1271e6ba4fb769940fde98636b44fd0d8fa40cdb33ba9ad7
|