Pandas categorical profiling. Generates html profile report for categorical dataset. Also provides several handful functions.
Project description
pandas-cat
The pandas-cat is a Pandas's categorical profiling library.
pandas-cat is abbreviation of PANDAS-CATegorical profiling. This package provides profile for categorical attributes as well as (optional) adjustments of data set, e.g. estimating whether variable is numeric and order categories with respect to numbers etc.
The pandas-cat in more detail
The package creates (html) profile of the categorical dataset. It supports both ordinal (ordered) categories as well as nominal ones. Moreover, it overcomes typical issues with categorical, mainly ordered data that are typically available, like that categories are de facto numbers, or numbers with some enhancement and should be treated as ordered.
For example, in dataset Accidents
attribute Hit Objects in can be used as:
- unordered: 0.0 10.0 7.0 11.0 4.0 2.0 8.0 1.0 9.0 6.0 5.0 12.0 nan
- ordered: 0.0 1.0 10.0 11.0 12.0 2.0 4.0 5.0 6.0 7.0 8.0 9.0 nan
- as analyst wishes (package does): 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0 nan
Typical issues are (numbers are nor numbers):
- categories are intervals (like 75-100, 101-200)
- have category with some additional information (e.g. Over 75, 60+, <18, Under 16)
- have n/a category explicitly coded sorted in data
Therefore this library provides profiling as well as somehow automatic data preparation.
Currently, there are two methods in place:
profile
-- profiles a dataset, categories and their correlationsprepare
-- prepares a dataset, tries to understand label names (if they are numbers) and sort them
Installation
You can install the package using
pip install pandas-cat
Usage
The usage of this package is simple. Sample code follows (it uses dataset Accidents based on Kaggle dataset)
import pandas as pd
from pandas_cat import pandas_cat
#read dataset. You can download it and setup path to local file.
df = pd.read_csv ('https://petrmasa.com/pandas-cat/data/accidents.zip', encoding='cp1250', sep='\t')
#use only selected columns
df=df[['Driver_Age_Band','Driver_IMD','Sex','Journey']]
#longer demo report uses this set of columns instead of the first one
#df=df[['Driver_Age_Band','Driver_IMD','Sex','Journey','Hit_Objects_in','Hit_Objects_off','Casualties','Severity','Area','Vehicle_Age','Road_Type','Speed_limit','Light','Vehicle_Location','Vehicle_Type']]
#for profiling, use following code
pandas_cat.profile(df=df,dataset_name="Accidents",opts={"auto_prepare":True})
#for just adjusting dataset, use following code
df = pandas_cat.prepare(df)
Data and sample reports
Sample reports are here - basic and longer. Note that these reports have been generated with code above.
The dataset is downloaded from the web (each time you run the code). If you want, you can download sample dataset here and store it locally.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pandas-cat-0.1.2.tar.gz
.
File metadata
- Download URL: pandas-cat-0.1.2.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc1e48791c1eba6a0be0f05f9278b3656d55022988d0b330587dcc9142f21c39 |
|
MD5 | 28ca03a038a26c5d16a72263999ad9a1 |
|
BLAKE2b-256 | eb26b58f0212c851f3959117870c4c0fead2acf94692a8e56f3e9cbd73d25752 |
File details
Details for the file pandas_cat-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: pandas_cat-0.1.2-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 525f8b5e2fd19ae604732a3a2a26f20be94398a176e9f7173bdbb6ea969b4fd4 |
|
MD5 | 4d6889357d6ce6326de7ddc9eaf9eb06 |
|
BLAKE2b-256 | 1db8812042e6866317df4533539658dc7ed08ea6bccfb6cf960ff2f746cf276c |