Pandas categorical utils - lightweight automatic category ordering, missing category detection, profiling - attribute frequencies, correlations, category correlations.
Project description
pandas-cat
The pandas-cat is a Pandas's categorical utils library.
pandas-cat is abbreviation of PANDAS-CATegorical utils. This package provides
- automatic ordering for ordinal variables - a lightweigth module for converting string categories to ordered ones if possible (based on numbers inside texts, like "Over 25"
- advanced missing value detection - detection of typical missing data encoding (typical = detect encodings that we have manually identified in more than 100+ datasets)
- categorical data profiling - profile for categorical attributes
The pandas-cat in more detail
Ordinal data ordering
This package tries to convert strings to ordered categories. For example (Vehicle_Age in Accidents dataset),
ORIGINAL (unordered) : 1 6 5 4 9 14 >20 10 8 15 12 11 16-20 3 13 2 7
ALPHABETICALLY ORDERED (strings do not allow numeric ordering) : >20 1 10 11 12 13 14 15 16-20 2 3 4 5 6 7 8 9
AS ANALYST WISHES (package does) : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16-20 >20
Typical issues are (numbers are nor numbers):
- categories are intervals (like 75-100, 101-200)
- have category with some additional information (e.g. Over 75, 60+, <18, Under 16)
- have n/a or other string category explicitly coded sorted in data
Missing detection and replacement
Missing values are typically encoded in many ways (N/A, #N/A, NA, invalid, NO INFO, NOT RATED, NOT GIVEN, NOT DEFINED, undefined, ...). We have manually went through more than 100 datasets with missing values and detected typical encoding of missing values and added to this library as an automatic missing detection and replacement.
The build-in list can be easily extended (by adding next strings as parameters). Also, when some string should be preserved, package can be instructed to keep it.
Profiling
The package creates (html) profile of the categorical dataset. It supports both ordinal (ordered) categories as well as nominal ones. Currently, there are two templates available
- standard - standard template with embedded charts
- interactive - interactive template with dynamically generated charts
The report contains:
- an overview - basic information about the dataset, consumption in the memory and category names
- profiles - profiles of attributes - charts with frequencies of categories (all, not first TOP n as typical for universal profiling packages)
- correlations - correlations between attributes and values
Installation and using the package
You can install the package using
pip install pandas-cat
To load your dataset into a Pandas DataFrame, you can use the read_csv() method for CSV files or the read_excel() method for Excel files. Both methods support a parameter called keep_default_na, which you can set to False. This prevents Pandas from detecting missing values, as pandas-cat offers a much more comprehensive detection system, including all the values Pandas detects. For faster report generation, you can select specific columns for analysis by filtering them directly in Pandas.
Sample Code
import pandas as pd
from pandas_cat import pandas_cat
# Read dataset. You can download it and set up a path to the local file.
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
encoding='cp1250', sep='\t')
# Use only selected columns
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]
# Generate a profile report with the default template
pandas_cat.profile(df=df, dataset_name="Accidents", opts={"auto_prepare": True})
For longer demo report use this set of columns instead of the first one
df = df[['Driver_Age_Band','Driver_IMD','Sex','Journey','Hit_Objects_in','Hit_Objects_off','Casualties','Severity','Area','Vehicle_Age','Road_Type','Speed_limit','Light','Vehicle_Location','Vehicle_Type']]
To generate interactive report, set the template to interactive
pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive", opts={"auto_prepare": True})
For advanced customization, use additional options
pandas_cat.profile(
df=df,
dataset_name="Accidents",
template="interactive",
opts={
"auto_prepare": True,
"cat_limit": 60, # Maximum categories for profiling
"na_values": ["MyNA", "MyNull"], # Custom missing values
"na_ignore": ["NA"], # Exclude specific values from missing detection
"keep_default_na": True # Use default missing values build-in list
}
)
To adjust the dataset only without generating a report
df = pandas_cat.prepare(df)
Data and sample reports
Sample reports are here
The dataset is downloaded from the web (each time you run the code). If you want, you can download sample dataset here and store it locally.
Credits
Petr Masa - Base package, basic data preparation
Jan Nejedly - Interactive report, handling missing values
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandas_cat-0.1.4.tar.gz.
File metadata
- Download URL: pandas_cat-0.1.4.tar.gz
- Upload date:
- Size: 47.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cdd19d30821e80455f4d00b241de129e9176431cb91feca2d00d268fb902db4
|
|
| MD5 |
608f09a80b50af3708f389700da04c57
|
|
| BLAKE2b-256 |
2cdce41391ac25d97c8fc967dfc1acf7c672f5c6d312ecc675b5d3373786ea99
|
File details
Details for the file pandas_cat-0.1.4-py3-none-any.whl.
File metadata
- Download URL: pandas_cat-0.1.4-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
037606b74e176bbb2fe2062a6d108c3675a476ef657e60bd3b33b959ee70a787
|
|
| MD5 |
1289e8105a8b511f856947042583be5a
|
|
| BLAKE2b-256 |
1fa51904a96d103f46e13497957b3b5a07b3d7b6b940ec824347b402f8e9cdc8
|