Skip to main content

Machine learning with dirty categories.

Project description

dirty_cat is a Python module for machine-learning on dirty categorical variables.


For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].



dirty_cat requires:

  • Python (>= 3.6)
  • NumPy (>= 1.8.2)
  • SciPy (>= 1.0.1)
  • scikit-learn (>= 0.20.0)

Optional dependency:

  • python-Levenshtein for faster edit distances (not used for the n-gram distance)

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install dirty_cat is using pip

pip install -U --user dirty_cat

Other implementations


[1]Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.
[2]Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for dirty-cat, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size dirty_cat-0.1.0-py3-none-any.whl (112.0 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size dirty_cat-0.1.0.tar.gz (98.1 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page