Profile pandas DataFrames and generate self-contained HTML reports with categorical and continuous column analysis.

These details have not been verified by PyPI

Project links

Project description

pandas-cat

pandas-cat (PANDAS-CATegorical profiling) is a library for profiling categorical datasets and preparing them for analysis. It generates HTML reports with category distributions, correlations, and missing-value summaries, and automatically reorders numeric-like categories into their natural order.

Datasets have typically mixed variables (both categorical and continuous) and package can show both of them. Types of variables are detected automatically. Because pandas-cat focuses on categorical profiling, the default preparation engine converts numeric columns with few distinct values (<= cat_limit, default 20) to ordered categoricals so a 0/1 flag or a 1–5 rating scale gets a frequency bar chart rather than histogram. Numeric columns with more distinct values are left as continuous. If you really need it, this behaviour can be overridden by passing auto_prepare=False or by casting the column to float before calling profile().

Pass any DataFrame and get a self-contained HTML report in one call:

import pandas_cat
pandas_cat.profile(df, dataset_name="Road accidents")

The report gives you:

Bar charts — frequency counts and percentages for every categorical column.
Histograms — distribution for every numeric column.
Correlations — between all variables and between categorical values.
Missing-value summary — sentinel detection and gap counts per column.
Memory breakdown — usage by column.

Two preparation helpers keep the data clean before profiling (you can use them also separately):

prepare(df) detects numeric-like categories and converts them to ordered CategoricalDtype so charts and correlations respect natural order.

Without prepare(), pandas sorts categories alphabetically — a common trap:
```
# Alphabetical (wrong) — pandas default
16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 6–10, 76+, Under 6
```
After prepare(), the natural numeric order is restored:
```
# Natural order (correct) — after prepare()
Under 6, 6–10, 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 76+
```
handle_missing_values(df) replaces 75+ sentinel strings ("Unknown", "N/A", "–", "Missing", …) with pd.NA so they are counted as missing rather than treated as valid categories.

Installation

pip install pandas-cat

Quick start

import pandas as pd
import pandas_cat

df = pd.read_csv('data.csv')
pandas_cat.profile(df=df, dataset_name="My dataset")
# generates report/report.html

Continuous variables

Numeric columns are profiled out of the box as continuous. The built-in preparation engine (auto_prepare=True) preserves numeric columns with many unique values (above cat_limit) as continuous — they are profiled with histograms rather than excluded. Low-cardinality numeric columns are converted to ordered categoricals instead.

df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')

pandas_cat.profile(df=df, dataset_name="Accidents", out_html="accidents_full.html")
# Columns like Driver_Age, Hour, Engine_Capacity are profiled with histograms.
# String-encoded categories like Driver_Age_Band are ordered correctly.

Categorical data with numeric-looking values

Many real datasets store ordered categories as strings: "0-10", "Over 75", "60+". Alphabetical sorting produces "Over 75" before "Under 5". pandas-cat fixes this automatically:

df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]

pandas_cat.profile(df=df, dataset_name="Accidents")

auto_prepare=True (default) converts Driver_Age_Band to an ordered pandas.Categorical sorted by the extracted numeric values before profiling.

Report templates

# Default — static HTML with SVG charts
pandas_cat.profile(df=df, dataset_name="Accidents")

# Modern — same content, refreshed visual style
pandas_cat.profile(df=df, dataset_name="Accidents", template="modern")

# Interactive — three correlation metrics (Cramér's V, Spearman, Theil's U),
# per-category crosstabs, raw data driven
pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive")

All options

pandas_cat.profile(
    df=df,
    dataset_name="Accidents",
    out_html="report.html",   # written to report/<out_html>
    opts={
        "auto_prepare":    True,         # convert numeric-string categories
        "cat_limit":       20,           # max categories before column is excluded
        "na_values":       ["MyNA"],     # extra missing-value sentinels
        "na_ignore":       ["NA"],       # built-in sentinels to keep as-is
        "keep_default_na": True,         # False = use only na_values
    }
)

Data preparation only

# Built-in engine (default) — preserves high-cardinality numeric columns as continuous
df = pandas_cat.prepare(df)

# CleverMiner engine — opt in explicitly
df = pandas_cat.prepare(df, auto_data_prep="CLM")

Missing-value handling only

df, detected, counts = pandas_cat.handle_missing_values(
    df,
    na_values=["TBD"],
    na_ignore=["-"],
)

75+ built-in sentinel strings are detected by default (NA, N/A, NULL, None, Unknown, Missing, …).

Sample reports

Credits

Petr Masa — base package, data preparation, maintaining the package and development

Jan Nejedly — interactive report (first version), missing-value handling (first version)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.5

May 31, 2026

0.1.4

Apr 9, 2026

0.1.3

Jan 1, 2025

0.1.2

Dec 27, 2023

0.1.1

Jun 25, 2023

0.1.0

Apr 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_cat-0.1.5.tar.gz (59.1 kB view details)

Uploaded May 31, 2026 Source

File details

Details for the file pandas_cat-0.1.5.tar.gz.

File metadata

Download URL: pandas_cat-0.1.5.tar.gz
Upload date: May 31, 2026
Size: 59.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pandas_cat-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`430fb1ecea2ca7aeb2158bfd78d5305bff7b21b6df610d366aed9ba4fcb677b0`
MD5	`0b9aaf081949125286921f74c1870f14`
BLAKE2b-256	`ae10add308d15ffed887da065c815028fca88d8dacd43a6ac25996ee67d707da`

See more details on using hashes here.

pandas-cat 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pandas-cat

Installation

Quick start

Continuous variables

Categorical data with numeric-looking values

Report templates

All options

Data preparation only

Missing-value handling only

Sample reports

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes