Skip to main content

A collection sklearn transformers to encode categorical variables as numeric

Project description

Categorical Encoding Methods

[![Travis Status](](
[![Coveralls Status](](
[![CircleCI Status](](

A set of scikit-learn-style transformers for encoding categorical
variables into numeric by means of different techniques.

Important Links

Documentation: [](

Encoding Methods

* Backward Difference Contrast [2][3]
* BaseN [6]
* Binary [5]
* Hashing [1]
* Helmert Contrast [2][3]
* James-Stein Estimator [9]
* LeaveOneOut [4]
* M-estimator [7]
* Ordinal [2][3]
* One-Hot [2][3]
* Polynomial Contrast [2][3]
* Sum Contrast [2][3]
* Target Encoding [7]
* Weight of Evidence [8]


The package requires: `numpy`, `statsmodels`, and `scipy`.

To install the package, execute:

$ python install


pip install category_encoders


conda install -c conda-forge category_encoders

To install the development version, you may use:

pip install --upgrade git+


All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.

There are two types of encoders: unsupervised and supervised. An unsupervised example:
from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y =
X = pd.DataFrame(, columns=bunch.feature_names)

# use binary encoding to encode two categorical features
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)

# transform the dataset
numeric_dataset = enc.transform(X)

And a supervised example:
from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y_train =[0:250]
y_test =[250:506]
X_train = pd.DataFrame([0:250], columns=bunch.feature_names)
X_test = pd.DataFrame([250:506], columns=bunch.feature_names)

# use target encoding to encode two categorical features
enc = TargetEncoder(cols=['CHAS', 'RAD']).fit(X_train, y_train)

# transform the datasets
training_numeric_dataset = enc.transform(X_train, y_train)
testing_numeric_dataset = enc.transform(X_test)

Additional examples and benchmarks can be found in the `examples` directory.


Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the file
or open an issue on the github project to get started.


1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From
3. Gregory Carey (2003). Coding Categorical Variables. From
4. Strategies to encode categorical variables with many categories. From
5. Beyond One-Hot: an exploration of categorical variables. From
6. BaseN Encoding and Grid Search in categorical variables. From
7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From
8. Weight of Evidence (WOE) and Information Value Explained. From
9. Empirical Bayes for multiple sample sizes. From

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for category-encoders, version 2.0.0
Filename, size File type Python version Upload date Hashes
Filename, size category_encoders-2.0.0-py2.py3-none-any.whl (87.8 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size category_encoders-2.0.0.tar.gz (49.8 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page