A library for tracking pandas operations and generating Mermaid flowcharts

These details have not been verified by PyPI

Project links

Project description

pandas-flowchart 📊

A Python library that integrates with pandas to automatically track data transformation operations and generate visual flowcharts using HTML or Mermaid diagrams.

Features

Automatic Operation Tracking: Intercepts common pandas operations (merge, filter, assign, drop, groupby, etc.)
Structured Metadata Recording: Captures operation details, row counts, and custom statistics
Visual Flowcharts: Generates Mermaid diagrams with color-coded operation boxes
Variable Monitoring: Track specific variables' unique counts and statistics across the pipeline
Mini-Histograms: ASCII sparkline histograms for numeric variables
Multiple Output Formats: Export to Markdown, HTML, or raw Mermaid syntax

Example: Healthcare Data Pipeline

This example tracks a realistic analytics workflow for a medical provider: loading patient/exam records, merging them on patient_id, filtering for active adults, deriving age groups, deduplicating visits, and then branching off into staged summaries before rendering the final Mermaid diagram shown below.

Healthcare Data Pipeline

Installation

pip install pandas-flowchart

Or install from source:

git clone https://github.com/yourusername/pandas-flowchart.git
cd pandas-flowchart
pip install -e .

Quick Start

import pandas as pd
import pandas_flow

# Setup the tracker with variables to monitor
flow = pandas_flow.setup(
    track_row_count=True,
    track_variables={
        "patient_id": "n_unique",
        "exam_date": "n_unique",
    },
    stats_variable="age",
    stats_types=["min", "max", "mean", "std", "histogram"],
)

# Your pandas operations are automatically tracked
patients = pd.read_csv("patients.csv")
exams = pd.read_csv("exams.csv")

# Merge datasets
combined = patients.merge(exams, on="patient_id", how="inner")

# Filter adults
adults = combined.query("age >= 18")

# Add calculated columns
adults = adults.assign(
    age_group=lambda x: pd.cut(x["age"], bins=[18, 30, 50, 70, 100])
)

# Remove duplicates
clean_data = adults.drop_duplicates(subset=["patient_id", "exam_date"])

# Generate the flowchart
flow.render("pipeline_flowchart.md")

This generates a beautiful Mermaid flowchart showing each operation with:

Operation type and description
Input/output row counts
Tracked variable statistics
Distribution histograms

Detailed Usage

Setting Up the Tracker

import pandas_flow

flow = pandas_flow.setup(
    # Track row counts after each operation
    track_row_count=True,
  
    # Variables to monitor (name -> stat_type)
    # stat_type can be: "n_total", "n_non_null", "n_unique"
    track_variables={
        "user_id": "n_unique",
        "transaction_date": "n_unique",
        "product_category": "n_unique",
    },
  
    # Variable for detailed statistics
    stats_variable="amount",
  
    # Which stats to compute for stats_variable
    stats_types=["min", "max", "mean", "std", "top3_freq", "histogram"],
  
    # Auto-intercept pandas operations (default: True)
    auto_intercept=True,
  
    # Visual theme: "default", "dark", or "light"
    theme="default",
)

Tracked Operations

The library automatically intercepts these pandas operations:

Category	Operations
Data Loading	`read_csv`, `read_excel`, `read_parquet`, `read_json`
Filtering	`query`, `loc`, `iloc`, boolean indexing
Joins	`merge`, `join`
Column Operations	`assign`, `drop`, `rename`
Concatenation	`concat`
Groupby	`groupby` + `agg`/`transform`
Reshape	`pivot`, `pivot_table`, `melt`
Cleaning	`drop_duplicates`, `dropna`, `fillna`
Sorting	`sort_values`, `sort_index`

Manual Tracking

For operations that can't be automatically intercepted (like boolean indexing), use manual tracking:

from pandas_flow.interceptors import track_filter

# Before filtering
original_df = df.copy()

# Filter with boolean indexing
df = df[df["status"] == "active"]

# Manually track the operation
track_filter(flow, original_df, df, 'status == "active"')

Or use the decorator pattern:

@flow.track("Custom Processing", OperationType.CUSTOM)
def process_data(df):
    # Your custom logic
    return df.pipe(custom_transform)

result = process_data(input_df)

Generating Output

# Markdown with Mermaid code block
flow.render("pipeline.md")

# Standalone HTML page (interactive)
flow.render("pipeline.html")

# Raw Mermaid syntax
flow.render("pipeline.mmd")

# Get Mermaid code as string
mermaid_code = flow.get_mermaid(
    title="My Data Pipeline",
    direction="TB",  # TB, LR, BT, RL
    include_legend=False,
    include_stats=True,
)

Context Manager Usage

with pandas_flow.setup(track_variables={"id": "n_unique"}) as flow:
    df = pd.read_csv("data.csv")
    df = df.query("active == True")
    df = df.drop_duplicates()
  
    flow.render("output.md")
# Interceptors are automatically removed after the context

Output Example

Box Contents

Each operation box includes:

Operation name (bold header)
Description (what the operation does)
Input DataFrames with source filename and dimensions
Output DataFrame dimensions
Row change indicator (↑ increase / ↓ decrease with percentage)
Tracked variable statistics
Distribution histogram (ASCII sparkline or embedded image with x-axis)

Color Scheme

Operations are color-coded by type (pastel/less saturated colors):

Operation Type	Color
Data Loading	Soft Gray (#9ca3af)
Filtering	Soft Blue (#7cb3d9)
Joins	Soft Green (#6dc993)
Column Creation	Soft Orange (#f0a86e)
Drop Operations	Soft Red (#e8918a)
Groupby	Soft Purple (#b99ad1)
Concatenation	Soft Teal (#6bc4ce)
Reshape	Soft Pink (#f5a3c7)
Sorting	Soft Yellow (#f5d76e)

API Reference

`pandas_flow.setup()`

Main entry point to create and activate a FlowTracker.

Parameters:

track_row_count (bool): Track row counts after each operation. Default: True
track_variables (dict): Map of variable names to stat types. Default: None
stats_variable (str): Variable for detailed statistics. Default: None
stats_types (list): Statistics to compute. Default: ["min", "max", "mean", "std", "top3_freq", "histogram"]
auto_intercept (bool): Auto-intercept pandas operations. Default: True
theme (str): Color theme. Options: "default", "dark", "light"

Returns: FlowTracker instance

`FlowTracker.render()`

Render the flowchart to a file.

Parameters:

output_path (str): Output file path (.md, .html, or .mmd)
title (str): Diagram title. Default: "Data Flow Pipeline"
direction (str): Flow direction. Options: "TB", "LR", "BT", "RL"
include_legend (bool): Include color legend. Default: False
include_stats (bool): Include statistics in boxes. Default: True

`FlowTracker.get_mermaid()`

Get Mermaid code without saving to file.

`FlowTracker.summary()`

Get a text summary of all recorded operations.

`FlowTracker.clear()`

Clear all recorded events.

Architecture

pandas_flow/
├── __init__.py          # Public API exports
├── tracker.py           # FlowTracker central class
├── events.py            # Event types and metadata classes
├── interceptors.py      # Pandas operation interceptors
├── stats.py             # Statistics calculator
├── visualization.py     # ASCII art utilities
└── mermaid_renderer.py  # Mermaid diagram generator

Design Principles

Non-invasive: Intercepts operations without modifying your code
Configurable: Track only what you need
Extensible: Easy to add custom operations
Performant: Minimal overhead during data processing

Advanced Features

Multiple DataFrames

The library correctly handles pipelines with multiple DataFrames:

df1 = pd.read_csv("sales.csv")
df2 = pd.read_csv("products.csv")
df3 = pd.read_csv("customers.csv")

# Multiple merges are tracked with proper connections
result = df1.merge(df2, on="product_id").merge(df3, on="customer_id")

Chained Operations

Method chaining is fully supported:

result = (
    pd.read_csv("data.csv")
    .query("status == 'active'")
    .drop_duplicates(subset=["id"])
    .assign(processed=True)
    .sort_values("date")
)

Export to PNG

For PNG export, install the optional dependency:

pip install pandas-flowchart[png]

Then use the Mermaid CLI or a Mermaid renderer service.

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Dec 22, 2025

0.2.2

Dec 12, 2025

0.2.1

Dec 12, 2025

0.2.0

Dec 12, 2025

0.1.4

Dec 12, 2025

0.1.3

Dec 12, 2025

This version

0.1.2

Dec 12, 2025

0.1.1

Dec 12, 2025

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_flowchart-0.1.2.tar.gz (806.0 kB view details)

Uploaded Dec 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandas_flowchart-0.1.2-py3-none-any.whl (681.4 kB view details)

Uploaded Dec 12, 2025 Python 3

File details

Details for the file pandas_flowchart-0.1.2.tar.gz.

File metadata

Download URL: pandas_flowchart-0.1.2.tar.gz
Upload date: Dec 12, 2025
Size: 806.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pandas_flowchart-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`5e1a68e225f62d87c07eb1fb230b003f1d9b25cb79ba46fcd82151eb4345df76`
MD5	`da5231bb93f5e707c50cc905b9b52006`
BLAKE2b-256	`cd1704dad7b2f180e26abbe436561e8a2b1624be5eea9c3b38e48c25e84d6731`

See more details on using hashes here.

File details

Details for the file pandas_flowchart-0.1.2-py3-none-any.whl.

File metadata

Download URL: pandas_flowchart-0.1.2-py3-none-any.whl
Upload date: Dec 12, 2025
Size: 681.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pandas_flowchart-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5bc6b1858fe8fee7f281d093a6e892e30a572dac501588509832f300de4bc7c1`
MD5	`236ef9fdb00f520e72fd289f52141888`
BLAKE2b-256	`fddc6ce9e50eddab880112856e98fc835ca5e6ff79d8324d1c3ba66c85382dcf`

See more details on using hashes here.

pandas-flowchart 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pandas-flowchart 📊

Features

Example: Healthcare Data Pipeline

Installation

Quick Start

Detailed Usage

Setting Up the Tracker

Tracked Operations

Manual Tracking

Generating Output

Context Manager Usage

Output Example

Box Contents

Color Scheme

API Reference

pandas_flow.setup()

FlowTracker.render()

FlowTracker.get_mermaid()

FlowTracker.summary()

FlowTracker.clear()

Architecture

Design Principles

Advanced Features

Multiple DataFrames

Chained Operations

Export to PNG

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pandas_flow.setup()`

`FlowTracker.render()`

`FlowTracker.get_mermaid()`

`FlowTracker.summary()`

`FlowTracker.clear()`