Skip to main content

Profile pandas DataFrames and generate self-contained HTML reports with categorical and continuous column analysis.

Project description

pandas-cat

PyPI - License PyPI - Python Version PyPI - Wheel PyPI - Status PyPI - Downloads

pandas-cat (PANDAS-CATegorical profiling) is a library for profiling categorical datasets and preparing them for analysis. It generates HTML reports with category distributions, correlations, and missing-value summaries, and automatically reorders numeric-like categories into their natural order.

Datasets have typically mixed variables (both categorical and continuous) and package can show both of them. Types of variables are detected automatically. Because pandas-cat focuses on categorical profiling, the default preparation engine converts numeric columns with few distinct values (<= cat_limit, default 20) to ordered categoricals so a 0/1 flag or a 1–5 rating scale gets a frequency bar chart rather than histogram. Numeric columns with more distinct values are left as continuous. If you really need it, this behaviour can be overridden by passing auto_prepare=False or by casting the column to float before calling profile().

Pass any DataFrame and get a self-contained HTML report in one call:

import pandas_cat
pandas_cat.profile(df, dataset_name="Road accidents")

The report gives you:

  • Bar charts — frequency counts and percentages for every categorical column.
  • Histograms — distribution for every numeric column.
  • Correlations — between all variables and between categorical values.
  • Missing-value summary — sentinel detection and gap counts per column.
  • Memory breakdown — usage by column.

Two preparation helpers keep the data clean before profiling (you can use them also separately):

  • prepare(df) detects numeric-like categories and converts them to ordered CategoricalDtype so charts and correlations respect natural order.

    Without prepare(), pandas sorts categories alphabetically — a common trap:

    # Alphabetical (wrong) — pandas default
    16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 6–10, 76+, Under 6
    

    After prepare(), the natural numeric order is restored:

    # Natural order (correct) — after prepare()
    Under 6, 6–10, 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 76+
    
  • handle_missing_values(df) replaces 75+ sentinel strings ("Unknown", "N/A", "–", "Missing", …) with pd.NA so they are counted as missing rather than treated as valid categories.

Installation

pip install pandas-cat

Quick start

import pandas as pd
import pandas_cat

df = pd.read_csv('data.csv')
pandas_cat.profile(df=df, dataset_name="My dataset")
# generates report/report.html

Continuous variables

Numeric columns are profiled out of the box as continuous. The built-in preparation engine (auto_prepare=True) preserves numeric columns with many unique values (above cat_limit) as continuous — they are profiled with histograms rather than excluded. Low-cardinality numeric columns are converted to ordered categoricals instead.

df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')

pandas_cat.profile(df=df, dataset_name="Accidents", out_html="accidents_full.html")
# Columns like Driver_Age, Hour, Engine_Capacity are profiled with histograms.
# String-encoded categories like Driver_Age_Band are ordered correctly.

Categorical data with numeric-looking values

Many real datasets store ordered categories as strings: "0-10", "Over 75", "60+". Alphabetical sorting produces "Over 75" before "Under 5". pandas-cat fixes this automatically:

df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]

pandas_cat.profile(df=df, dataset_name="Accidents")

auto_prepare=True (default) converts Driver_Age_Band to an ordered pandas.Categorical sorted by the extracted numeric values before profiling.

Report templates

# Default — static HTML with SVG charts
pandas_cat.profile(df=df, dataset_name="Accidents")

# Modern — same content, refreshed visual style
pandas_cat.profile(df=df, dataset_name="Accidents", template="modern")

# Interactive — three correlation metrics (Cramér's V, Spearman, Theil's U),
# per-category crosstabs, raw data driven
pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive")

All options

pandas_cat.profile(
    df=df,
    dataset_name="Accidents",
    out_html="report.html",   # written to report/<out_html>
    opts={
        "auto_prepare":    True,         # convert numeric-string categories
        "cat_limit":       20,           # max categories before column is excluded
        "na_values":       ["MyNA"],     # extra missing-value sentinels
        "na_ignore":       ["NA"],       # built-in sentinels to keep as-is
        "keep_default_na": True,         # False = use only na_values
    }
)

Data preparation only

# Built-in engine (default) — preserves high-cardinality numeric columns as continuous
df = pandas_cat.prepare(df)

# CleverMiner engine — opt in explicitly
df = pandas_cat.prepare(df, auto_data_prep="CLM")

Missing-value handling only

df, detected, counts = pandas_cat.handle_missing_values(
    df,
    na_values=["TBD"],
    na_ignore=["-"],
)

75+ built-in sentinel strings are detected by default (NA, N/A, NULL, None, Unknown, Missing, …).

Sample reports

Credits

Petr Masa — base package, data preparation, maintaining the package and development

Jan Nejedly — interactive report (first version), missing-value handling (first version)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_cat-0.1.5.tar.gz (59.1 kB view details)

Uploaded Source

File details

Details for the file pandas_cat-0.1.5.tar.gz.

File metadata

  • Download URL: pandas_cat-0.1.5.tar.gz
  • Upload date:
  • Size: 59.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pandas_cat-0.1.5.tar.gz
Algorithm Hash digest
SHA256 430fb1ecea2ca7aeb2158bfd78d5305bff7b21b6df610d366aed9ba4fcb677b0
MD5 0b9aaf081949125286921f74c1870f14
BLAKE2b-256 ae10add308d15ffed887da065c815028fca88d8dacd43a6ac25996ee67d707da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page