Skip to main content

A library for tracking pandas operations and generating Mermaid flowcharts

Project description

pandas-flowchart 📊

A Python library that integrates with pandas to automatically track data transformation operations and generate visual flowcharts using Mermaid diagrams.

Features

  • Automatic Operation Tracking: Intercepts common pandas operations (merge, filter, assign, drop, groupby, etc.)
  • Structured Metadata Recording: Captures operation details, row counts, and custom statistics
  • Visual Flowcharts: Generates Mermaid diagrams with color-coded operation boxes
  • Variable Monitoring: Track specific variables' unique counts and statistics across the pipeline
  • Mini-Histograms: ASCII sparkline histograms for numeric variables
  • Multiple Output Formats: Export to Markdown, HTML, or raw Mermaid syntax

Installation

pip install pandas-flowchart

Or install from source:

git clone https://github.com/yourusername/pandas-flowchart.git
cd pandas-flowchart
pip install -e .

Quick Start

import pandas as pd
import pandas_flow

# Setup the tracker with variables to monitor
flow = pandas_flow.setup(
    track_row_count=True,
    track_variables={
        "patient_id": "n_unique",
        "exam_date": "n_unique",
    },
    stats_variable="age",
    stats_types=["min", "max", "mean", "std", "histogram"],
)

# Your pandas operations are automatically tracked
patients = pd.read_csv("patients.csv")
exams = pd.read_csv("exams.csv")

# Merge datasets
combined = patients.merge(exams, on="patient_id", how="inner")

# Filter adults
adults = combined.query("age >= 18")

# Add calculated columns
adults = adults.assign(
    age_group=lambda x: pd.cut(x["age"], bins=[18, 30, 50, 70, 100])
)

# Remove duplicates
clean_data = adults.drop_duplicates(subset=["patient_id", "exam_date"])

# Generate the flowchart
flow.render("pipeline_flowchart.md")

This generates a beautiful Mermaid flowchart showing each operation with:

  • Operation type and description
  • Input/output row counts
  • Tracked variable statistics
  • Distribution histograms

Detailed Usage

Setting Up the Tracker

import pandas_flow

flow = pandas_flow.setup(
    # Track row counts after each operation
    track_row_count=True,
    
    # Variables to monitor (name -> stat_type)
    # stat_type can be: "n_total", "n_non_null", "n_unique"
    track_variables={
        "user_id": "n_unique",
        "transaction_date": "n_unique",
        "product_category": "n_unique",
    },
    
    # Variable for detailed statistics
    stats_variable="amount",
    
    # Which stats to compute for stats_variable
    stats_types=["min", "max", "mean", "std", "top3_freq", "histogram"],
    
    # Auto-intercept pandas operations (default: True)
    auto_intercept=True,
    
    # Visual theme: "default", "dark", or "light"
    theme="default",
)

Tracked Operations

The library automatically intercepts these pandas operations:

Category Operations
Data Loading read_csv, read_excel, read_parquet, read_json
Filtering query, loc, iloc, boolean indexing
Joins merge, join
Column Operations assign, drop, rename
Concatenation concat
Groupby groupby + agg/transform
Reshape pivot, pivot_table, melt
Cleaning drop_duplicates, dropna, fillna
Sorting sort_values, sort_index

Manual Tracking

For operations that can't be automatically intercepted (like boolean indexing), use manual tracking:

from pandas_flow.interceptors import track_filter

# Before filtering
original_df = df.copy()

# Filter with boolean indexing
df = df[df["status"] == "active"]

# Manually track the operation
track_filter(flow, original_df, df, 'status == "active"')

Or use the decorator pattern:

@flow.track("Custom Processing", OperationType.CUSTOM)
def process_data(df):
    # Your custom logic
    return df.pipe(custom_transform)

result = process_data(input_df)

Generating Output

# Markdown with Mermaid code block
flow.render("pipeline.md")

# Standalone HTML page (interactive)
flow.render("pipeline.html")

# Raw Mermaid syntax
flow.render("pipeline.mmd")

# Get Mermaid code as string
mermaid_code = flow.get_mermaid(
    title="My Data Pipeline",
    direction="TB",  # TB, LR, BT, RL
    include_legend=False,
    include_stats=True,
)

Context Manager Usage

with pandas_flow.setup(track_variables={"id": "n_unique"}) as flow:
    df = pd.read_csv("data.csv")
    df = df.query("active == True")
    df = df.drop_duplicates()
    
    flow.render("output.md")
# Interceptors are automatically removed after the context

Output Example

Mermaid Flowchart

flowchart TB
    op_1[/"<b>Read CSV</b><br/><i>Load data from patients.csv</i><br/>⬅️ 10,000 rows × 5 cols<br/>──────────────────────<br/>🔑 patient_id: 8,500 unique<br/>mean=45.30 [18.0–92.0]<br/>📊 ▁▂▄█▆▃▂▁"/]
    
    op_2[/"<b>Read CSV</b><br/><i>Load data from exams.csv</i><br/>⬅️ 25,000 rows × 8 cols"/]
    
    op_3[["<b>Merge (inner)</b><br/><i>INNER join on patient_id</i><br/>➡️ patients.csv: 10,000×5<br/>➡️ exams.csv: 25,000×8<br/>⬅️ 23,500 rows × 12 cols"]]
    
    op_4{"<b>Query</b><br/><i>Filter: age >= 18</i><br/>⬅️ 22,100 rows × 12 cols<br/>↓ -1,400 (-6.0%)"}
    
    op_1 --> op_3
    op_2 --> op_3
    op_3 -.-> op_4
    
    style op_1 fill:#9ca3af,stroke:#6b7280,color:#000000
    style op_2 fill:#9ca3af,stroke:#6b7280,color:#000000
    style op_3 fill:#6dc993,stroke:#4ca36d,color:#000000
    style op_4 fill:#7cb3d9,stroke:#5691b7,color:#000000

Box Contents

Each operation box includes:

  • Operation name (bold header)
  • Description (what the operation does)
  • Input DataFrames with source filename and dimensions
  • Output DataFrame dimensions
  • Row change indicator (↑ increase / ↓ decrease with percentage)
  • Tracked variable statistics
  • Distribution histogram (ASCII sparkline or embedded image with x-axis)

Color Scheme

Operations are color-coded by type (pastel/less saturated colors):

Operation Type Color
Data Loading Soft Gray (#9ca3af)
Filtering Soft Blue (#7cb3d9)
Joins Soft Green (#6dc993)
Column Creation Soft Orange (#f0a86e)
Drop Operations Soft Red (#e8918a)
Groupby Soft Purple (#b99ad1)
Concatenation Soft Teal (#6bc4ce)
Reshape Soft Pink (#f5a3c7)
Sorting Soft Yellow (#f5d76e)

API Reference

pandas_flow.setup()

Main entry point to create and activate a FlowTracker.

Parameters:

  • track_row_count (bool): Track row counts after each operation. Default: True
  • track_variables (dict): Map of variable names to stat types. Default: None
  • stats_variable (str): Variable for detailed statistics. Default: None
  • stats_types (list): Statistics to compute. Default: ["min", "max", "mean", "std", "top3_freq", "histogram"]
  • auto_intercept (bool): Auto-intercept pandas operations. Default: True
  • theme (str): Color theme. Options: "default", "dark", "light"

Returns: FlowTracker instance

FlowTracker.render()

Render the flowchart to a file.

Parameters:

  • output_path (str): Output file path (.md, .html, or .mmd)
  • title (str): Diagram title. Default: "Data Flow Pipeline"
  • direction (str): Flow direction. Options: "TB", "LR", "BT", "RL"
  • include_legend (bool): Include color legend. Default: False
  • include_stats (bool): Include statistics in boxes. Default: True

FlowTracker.get_mermaid()

Get Mermaid code without saving to file.

FlowTracker.summary()

Get a text summary of all recorded operations.

FlowTracker.clear()

Clear all recorded events.

Architecture

pandas_flow/
├── __init__.py          # Public API exports
├── tracker.py           # FlowTracker central class
├── events.py            # Event types and metadata classes
├── interceptors.py      # Pandas operation interceptors
├── stats.py             # Statistics calculator
├── visualization.py     # ASCII art utilities
└── mermaid_renderer.py  # Mermaid diagram generator

Design Principles

  1. Non-invasive: Intercepts operations without modifying your code
  2. Configurable: Track only what you need
  3. Extensible: Easy to add custom operations
  4. Performant: Minimal overhead during data processing

Advanced Features

Multiple DataFrames

The library correctly handles pipelines with multiple DataFrames:

df1 = pd.read_csv("sales.csv")
df2 = pd.read_csv("products.csv")
df3 = pd.read_csv("customers.csv")

# Multiple merges are tracked with proper connections
result = df1.merge(df2, on="product_id").merge(df3, on="customer_id")

Chained Operations

Method chaining is fully supported:

result = (
    pd.read_csv("data.csv")
    .query("status == 'active'")
    .drop_duplicates(subset=["id"])
    .assign(processed=True)
    .sort_values("date")
)

Export to PNG

For PNG export, install the optional dependency:

pip install pandas-flowchart[png]

Then use the Mermaid CLI or a Mermaid renderer service.

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_flowchart-0.1.0.tar.gz (156.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_flowchart-0.1.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file pandas_flowchart-0.1.0.tar.gz.

File metadata

  • Download URL: pandas_flowchart-0.1.0.tar.gz
  • Upload date:
  • Size: 156.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pandas_flowchart-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2a971ef179df4cad1c196015d842aaae9a16aca5c3d93426f7220b1afd0f47ce
MD5 f985993b36b1ac3c8ecce5400683baed
BLAKE2b-256 6dfbc61139b26eec9191d869a9fc2ea49a636a6a7e5e25d6e163d4cb4ad554c9

See more details on using hashes here.

File details

Details for the file pandas_flowchart-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pandas_flowchart-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74c80878091dd3a3613e4e51d44170b216e7570385c00b73a70fb4632c1bafef
MD5 4b88d8f5aada96282a8277edb8b3dc3a
BLAKE2b-256 00bff0ce7b0b3d6a11f7fb2dca95f92ab2a83be7d9f74c030283f9014f5b3234

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page