Skip to main content

Machine learning with dirty categories.

Project description

dirty_cat is a Python module for machine-learning on dirty categorical variables.

Website: https://dirty-cat.github.io/

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1].

Installation

Dependencies

dirty_cat requires:

  • Python (>= 3.5)
  • NumPy (>= 1.8.2)
  • SciPy (>= 1.0.1)
  • scikit-learn (>= 0.20.0)

Optional dependency:

  • python-Levenshtein for faster edit distances (not used for the n-gram distance)

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install dirty_cat is using pip

pip install -U --user dirty_cat

Other implementations

References

[1]Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018, Machine Learning journal, Springer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
dirty_cat-0.0.5-py2.py3-none-any.whl (91.2 kB) Copy SHA256 hash SHA256 Wheel 3.6
dirty_cat-0.0.5.tar.gz (80.0 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page