Skip to main content

Pandas categorical utils - lightweight automatic category ordering, missing category detection, profiling - attribute frequencies, correlations, category correlations.

Project description

pandas-cat

PyPI - License PyPI - Python Version PyPI - Wheel PyPI - Status PyPI - Downloads

The pandas-cat is a Pandas's categorical utils library.

pandas-cat is abbreviation of PANDAS-CATegorical utils. This package provides

  • automatic ordering for ordinal variables - a lightweigth module for converting string categories to ordered ones if possible (based on numbers inside texts, like "Over 25"
  • advanced missing value detection - detection of typical missing data encoding (typical = detect encodings that we have manually identified in more than 100+ datasets)
  • categorical data profiling - profile for categorical attributes

The pandas-cat in more detail

Ordinal data ordering

This package tries to convert strings to ordered categories. For example (Vehicle_Age in Accidents dataset),

ORIGINAL (unordered)                                           : 1 6 5 4 9 14 >20 10 8 15 12 11 16-20 3 13 2 7   
ALPHABETICALLY ORDERED (strings do not allow numeric ordering) : >20 1 10 11 12 13 14 15 16-20 2 3 4 5 6 7 8 9 
AS ANALYST WISHES (package does)                               : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16-20 >20

Typical issues are (numbers are nor numbers):

  • categories are intervals (like 75-100, 101-200)
  • have category with some additional information (e.g. Over 75, 60+, <18, Under 16)
  • have n/a or other string category explicitly coded sorted in data

Missing detection and replacement

Missing values are typically encoded in many ways (N/A, #N/A, NA, invalid, NO INFO, NOT RATED, NOT GIVEN, NOT DEFINED, undefined, ...). We have manually went through more than 100 datasets with missing values and detected typical encoding of missing values and added to this library as an automatic missing detection and replacement.

The build-in list can be easily extended (by adding next strings as parameters). Also, when some string should be preserved, package can be instructed to keep it.

Profiling

The package creates (html) profile of the categorical dataset. It supports both ordinal (ordered) categories as well as nominal ones. Currently, there are two templates available

  • standard - standard template with embedded charts
  • interactive - interactive template with dynamically generated charts

The report contains:

  • an overview - basic information about the dataset, consumption in the memory and category names
  • profiles - profiles of attributes - charts with frequencies of categories (all, not first TOP n as typical for universal profiling packages)
  • correlations - correlations between attributes and values

Installation and using the package

You can install the package using

pip install pandas-cat

To load your dataset into a Pandas DataFrame, you can use the read_csv() method for CSV files or the read_excel() method for Excel files. Both methods support a parameter called keep_default_na, which you can set to False. This prevents Pandas from detecting missing values, as pandas-cat offers a much more comprehensive detection system, including all the values Pandas detects. For faster report generation, you can select specific columns for analysis by filtering them directly in Pandas.

Sample Code

import pandas as pd
from pandas_cat import pandas_cat

# Read dataset. You can download it and set up a path to the local file.
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')

# Use only selected columns
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]

# Generate a profile report with the default template
pandas_cat.profile(df=df, dataset_name="Accidents", opts={"auto_prepare": True})

For longer demo report use this set of columns instead of the first one

df = df[['Driver_Age_Band','Driver_IMD','Sex','Journey','Hit_Objects_in','Hit_Objects_off','Casualties','Severity','Area','Vehicle_Age','Road_Type','Speed_limit','Light','Vehicle_Location','Vehicle_Type']]

To generate interactive report, set the template to interactive

pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive", opts={"auto_prepare": True})

For advanced customization, use additional options

pandas_cat.profile(
    df=df,
    dataset_name="Accidents",
    template="interactive",
    opts={
        "auto_prepare": True,
        "cat_limit": 60,  # Maximum categories for profiling
        "na_values": ["MyNA", "MyNull"],  # Custom missing values
        "na_ignore": ["NA"],  # Exclude specific values from missing detection
        "keep_default_na": True  # Use default missing values build-in list
    }
)

To adjust the dataset only without generating a report

df = pandas_cat.prepare(df)

Data and sample reports

Sample reports are here

The dataset is downloaded from the web (each time you run the code). If you want, you can download sample dataset here and store it locally.

Credits

Petr Masa - Base package, basic data preparation

Jan Nejedly - Interactive report, handling missing values

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_cat-0.1.4.tar.gz (47.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_cat-0.1.4-py3-none-any.whl (57.5 kB view details)

Uploaded Python 3

File details

Details for the file pandas_cat-0.1.4.tar.gz.

File metadata

  • Download URL: pandas_cat-0.1.4.tar.gz
  • Upload date:
  • Size: 47.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pandas_cat-0.1.4.tar.gz
Algorithm Hash digest
SHA256 5cdd19d30821e80455f4d00b241de129e9176431cb91feca2d00d268fb902db4
MD5 608f09a80b50af3708f389700da04c57
BLAKE2b-256 2cdce41391ac25d97c8fc967dfc1acf7c672f5c6d312ecc675b5d3373786ea99

See more details on using hashes here.

File details

Details for the file pandas_cat-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: pandas_cat-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 57.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pandas_cat-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 037606b74e176bbb2fe2062a6d108c3675a476ef657e60bd3b33b959ee70a787
MD5 1289e8105a8b511f856947042583be5a
BLAKE2b-256 1fa51904a96d103f46e13497957b3b5a07b3d7b6b940ec824347b402f8e9cdc8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page