Skip to main content

A collection sklearn transformers to encode categorical variables as numeric

Project description

Categorical Encoding Methods
============================

[![Travis Status](https://travis-ci.org/scikit-learn-contrib/categorical-encoding.svg?branch=master)](https://travis-ci.org/scikit-learn-contrib/categorical-encoding)
[![Coveralls Status](https://coveralls.io/repos/scikit-learn-contrib/categorical-encoding/badge.svg?branch=master&service=github)](https://coveralls.io/r/scikit-learn-contrib/categorical-encoding)
[![CircleCI Status](https://circleci.com/gh/scikit-learn-contrib/categorical-encoding.svg?style=shield&circle-token=:circle-token)](https://circleci.com/gh/scikit-learn-contrib/categorical-encoding/tree/master)
[![DOI](https://zenodo.org/badge/47077067.svg)](https://zenodo.org/badge/latestdoi/47077067)

A set of scikit-learn-style transformers for encoding categorical
variables into numeric by means of different techniques.

Important Links
---------------

Documentation: [http://contrib.scikit-learn.org/categorical-encoding/](http://contrib.scikit-learn.org/categorical-encoding/)

Encoding Methods
----------------

* Backward Difference Contrast [2][3]
* BaseN [6]
* Binary [5]
* Hashing [1]
* Helmert Contrast [2][3]
* James-Stein Estimator [9]
* LeaveOneOut [4]
* M-estimator [7]
* Ordinal [2][3]
* One-Hot [2][3]
* Polynomial Contrast [2][3]
* Sum Contrast [2][3]
* Target Encoding [7]
* Weight of Evidence [8]

Installation
-----

The package requires: `numpy`, `statsmodels`, and `scipy`.

To install the package, execute:

```shell
$ python setup.py install
```

or

```shell
pip install category_encoders
```

or

```shell
conda install -c conda-forge category_encoders
```

To install the development version, you may use:

```shell
pip install --upgrade git+https://github.com/scikit-learn-contrib/categorical-encoding
```

Usage
-----

All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.

Examples
--------
There are two types of encoders: unsupervised and supervised. An unsupervised example:
```python
from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

# use binary encoding to encode two categorical features
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)

# transform the dataset
numeric_dataset = enc.transform(X)
```

And a supervised example:
```python
from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y_train = bunch.target[0:250]
y_test = bunch.target[250:506]
X_train = pd.DataFrame(bunch.data[0:250], columns=bunch.feature_names)
X_test = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)

# use target encoding to encode two categorical features
enc = TargetEncoder(cols=['CHAS', 'RAD']).fit(X_train, y_train)

# transform the datasets
training_numeric_dataset = enc.transform(X_train, y_train)
testing_numeric_dataset = enc.transform(X_test)
```

Additional examples and benchmarks can be found in the `examples` directory.

Contributing
------------

Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file
or open an issue on the github project to get started.

References:
-----------

1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
3. Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
4. Strategies to encode categorical variables with many categories. From https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
5. Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
6. BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
8. Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
9. Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

category_encoders-2.0.0.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

category_encoders-2.0.0-py2.py3-none-any.whl (87.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file category_encoders-2.0.0.tar.gz.

File metadata

  • Download URL: category_encoders-2.0.0.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for category_encoders-2.0.0.tar.gz
Algorithm Hash digest
SHA256 cc2fe178fe6b4dc6fbf4de6b9070151889255f6d1306ffc22e8de79e20c0c047
MD5 fc054f58415a879f26b81dba209b5758
BLAKE2b-256 7c6dc5b12c2b9a03fd0a0564ffbaad07ce03f5429a38f44c33bb1ba641dd03fc

See more details on using hashes here.

File details

Details for the file category_encoders-2.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: category_encoders-2.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 87.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for category_encoders-2.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cbf6d8b47d63aac03d60d62c519e944709f2439ac57c1794dd606b25ba2dd40a
MD5 e017c6bdbfa19425a0a760e0c12c62f1
BLAKE2b-256 6ea1f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page