A smart and easy alternative to pandas.get_dummies()
Project description
smart_dummy
A smart and easy alternative to pandas.get_dummies.
Description
Smart_dummy uses a language-based model (Spacy) in combination with KMeans clustering to group categorical variables into logical groups. The user can specify how many groups they would like to output, which results in fewer columns compared to pandas.get_dummies(). For example, you might have a dataset with 900 unique categories, but smart_dummy allows you to cluster those industries together and get (for example) only 5 columns back instead of the 900 that you would have gotten using pandas.get_dummies().
Getting Started
Installing
pip install smart_dummy
Example Use
import pandas as pd
from smart_dummy import get_dummies
test_input = pd.DataFrame(['cat', 'dog', 'flower', 'tree', 'man', 'woman'], columns=['category'])
result = get_dummies(test_input['category'], 3).set_index(test_input['category'])
print(result)
Will give:
category_0 category_1 category_2
cat 0 0 1
dog 0 0 1
flower 1 0 0
tree 1 0 0
man 0 1 0
woman 0 1 0
Authors
Muriel Grobler (muriel.grobler@gmail.com)
Emma Zhang (emma.lzhang@gmail.com)
License
This project is licensed under the MIT License - see the LICENSE file for details
Acknowledgments
Many thanks to Arrive Logistics for allowing us to open-source this work. Please consider them as your future employer - it's a great place to work!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file smart_dummy-0.2.tar.gz
.
File metadata
- Download URL: smart_dummy-0.2.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 287f979adf401e8eee69f665d3ab33e4aef79ff65f0f4dd157377de1fe885eb1 |
|
MD5 | b3464f06caac9d24fb99eaa9dcdc3b1e |
|
BLAKE2b-256 | fc899c703447d2df4059f66d1650d477c33a0f51692ec9f6096fb9e955d55737 |