Skip to main content

A smart and easy alternative to pandas.get_dummies()

Project description

smart_dummy

A smart and easy alternative to pandas.get_dummies.

Description

Smart_dummy uses a language-based model (Spacy) in combination with KMeans clustering to group categorical variables into logical groups. The user can specify how many groups they would like to output, which results in fewer columns compared to pandas.get_dummies(). For example, you might have a dataset with 900 unique categories, but smart_dummy allows you to cluster those industries together and get (for example) only 5 columns back instead of the 900 that you would have gotten using pandas.get_dummies().

Getting Started

Installing

pip install smart_dummy

Example Use

import pandas as pd
from smart_dummy import get_dummies

test_input = pd.DataFrame(['cat', 'dog', 'flower', 'tree', 'man', 'woman'], columns=['category'])
result = get_dummies(test_input['category'], 3).set_index(test_input['category'])
print(result)

Will give:

   category_0  category_1  category_2
cat       0           0           1
dog       0           0           1
flower    1           0           0
tree      1           0           0
man       0           1           0
woman     0           1           0

Authors

Muriel Grobler (muriel.grobler@gmail.com)

Emma Zhang (emma.lzhang@gmail.com)

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments

Many thanks to Arrive Logistics for allowing us to open-source this work. Please consider them as your future employer - it's a great place to work!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_dummy-0.2.tar.gz (3.5 kB view details)

Uploaded Source

File details

Details for the file smart_dummy-0.2.tar.gz.

File metadata

  • Download URL: smart_dummy-0.2.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for smart_dummy-0.2.tar.gz
Algorithm Hash digest
SHA256 287f979adf401e8eee69f665d3ab33e4aef79ff65f0f4dd157377de1fe885eb1
MD5 b3464f06caac9d24fb99eaa9dcdc3b1e
BLAKE2b-256 fc899c703447d2df4059f66d1650d477c33a0f51692ec9f6096fb9e955d55737

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page