Profile pandas DataFrames and generate self-contained HTML reports with categorical and continuous column analysis.
Project description
pandas-cat
pandas-cat (PANDAS-CATegorical profiling) is a library for profiling categorical datasets and preparing them for analysis. It generates HTML reports with category distributions, correlations, and missing-value summaries, and automatically reorders numeric-like categories into their natural order.
Datasets have typically mixed variables (both categorical and continuous) and package can show both of them.
Types of variables are detected automatically. Because pandas-cat focuses on categorical profiling, the default
preparation engine converts numeric columns with few distinct values (<= cat_limit, default 20) to ordered categoricals
so a 0/1 flag or a 1–5 rating scale gets a frequency bar chart rather than histogram. Numeric columns with more distinct
values are left as continuous. If you really need it, this behaviour can be overridden by passing auto_prepare=False
or by casting the column to float before calling profile().
Pass any DataFrame and get a self-contained HTML report in one call:
import pandas_cat
pandas_cat.profile(df, dataset_name="Road accidents")
The report gives you:
- Bar charts — frequency counts and percentages for every categorical column.
- Histograms — distribution for every numeric column.
- Correlations — between all variables and between categorical values.
- Missing-value summary — sentinel detection and gap counts per column.
- Memory breakdown — usage by column.
Two preparation helpers keep the data clean before profiling (you can use them also separately):
-
prepare(df) detects numeric-like categories and converts them to ordered
CategoricalDtypeso charts and correlations respect natural order.Without
prepare(), pandas sorts categories alphabetically — a common trap:# Alphabetical (wrong) — pandas default 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 6–10, 76+, Under 6After
prepare(), the natural numeric order is restored:# Natural order (correct) — after prepare() Under 6, 6–10, 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 76+ -
handle_missing_values(df) replaces 75+ sentinel strings (
"Unknown","N/A","–","Missing", …) withpd.NAso they are counted as missing rather than treated as valid categories.
Installation
pip install pandas-cat
Quick start
import pandas as pd
import pandas_cat
df = pd.read_csv('data.csv')
pandas_cat.profile(df=df, dataset_name="My dataset")
# generates report/report.html
Continuous variables
Numeric columns are profiled out of the box as continuous. The built-in
preparation engine (auto_prepare=True) preserves numeric columns with many
unique values (above cat_limit) as continuous — they are profiled with
histograms rather than excluded. Low-cardinality numeric columns are converted
to ordered categoricals instead.
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
encoding='cp1250', sep='\t')
pandas_cat.profile(df=df, dataset_name="Accidents", out_html="accidents_full.html")
# Columns like Driver_Age, Hour, Engine_Capacity are profiled with histograms.
# String-encoded categories like Driver_Age_Band are ordered correctly.
Categorical data with numeric-looking values
Many real datasets store ordered categories as strings: "0-10", "Over 75",
"60+". Alphabetical sorting produces "Over 75" before "Under 5".
pandas-cat fixes this automatically:
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]
pandas_cat.profile(df=df, dataset_name="Accidents")
auto_prepare=True (default) converts Driver_Age_Band to an ordered
pandas.Categorical sorted by the extracted numeric values before profiling.
Report templates
# Default — static HTML with SVG charts
pandas_cat.profile(df=df, dataset_name="Accidents")
# Modern — same content, refreshed visual style
pandas_cat.profile(df=df, dataset_name="Accidents", template="modern")
# Interactive — three correlation metrics (Cramér's V, Spearman, Theil's U),
# per-category crosstabs, raw data driven
pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive")
All options
pandas_cat.profile(
df=df,
dataset_name="Accidents",
out_html="report.html", # written to report/<out_html>
opts={
"auto_prepare": True, # convert numeric-string categories
"cat_limit": 20, # max categories before column is excluded
"na_values": ["MyNA"], # extra missing-value sentinels
"na_ignore": ["NA"], # built-in sentinels to keep as-is
"keep_default_na": True, # False = use only na_values
}
)
Data preparation only
# Built-in engine (default) — preserves high-cardinality numeric columns as continuous
df = pandas_cat.prepare(df)
# CleverMiner engine — opt in explicitly
df = pandas_cat.prepare(df, auto_data_prep="CLM")
Missing-value handling only
df, detected, counts = pandas_cat.handle_missing_values(
df,
na_values=["TBD"],
na_ignore=["-"],
)
75+ built-in sentinel strings are detected by default (NA, N/A, NULL, None, Unknown, Missing, …).
Sample reports
- Short report — default template
- Full report — all columns, continuous variables
- Modern template
- Interactive report
Credits
Petr Masa — base package, data preparation, maintaining the package and development
Jan Nejedly — interactive report (first version), missing-value handling (first version)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pandas_cat-0.1.5.tar.gz.
File metadata
- Download URL: pandas_cat-0.1.5.tar.gz
- Upload date:
- Size: 59.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
430fb1ecea2ca7aeb2158bfd78d5305bff7b21b6df610d366aed9ba4fcb677b0
|
|
| MD5 |
0b9aaf081949125286921f74c1870f14
|
|
| BLAKE2b-256 |
ae10add308d15ffed887da065c815028fca88d8dacd43a6ac25996ee67d707da
|