Skip to main content

Advanced data pipeline debugging and profiling tools for Python

Project description

Project description

DataProbe

DataProbe is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow with enterprise-grade visualizations.

๐ŸŽจ NEW: Enterprise-Grade Visualizations

DataProbe v1.0.0 introduces comprehensive pipeline debugging capabilities with professional-quality visualizations, intelligent optimization recommendations, advanced memory profiling, data lineage tracking, and enterprise-grade reporting.

Dashboard Features

๐Ÿข Enterprise Dashboard

  • KPI Panels: Real-time success rates, duration, memory usage
  • Pipeline Flowchart: Interactive operation flow with status indicators
  • Performance Analytics: Memory usage timelines with peak detection
  • Data Insights: Comprehensive lineage and transformation tracking
# Generate enterprise dashboard
debugger.visualize_pipeline()

๐ŸŒ 3D Pipeline Network

  • 3D Visualization: Interactive network showing operation relationships
  • Performance Mapping: Z-axis represents operation duration
  • Status Color-coding: Visual error and bottleneck identification
# Create 3D network visualization
debugger.create_3d_pipeline_visualization()

๐Ÿ“Š Executive Reports

  • Multi-page Reports: Professional stakeholder-ready documentation
  • Performance Trends: Dual-axis charts showing duration and memory patterns
  • Optimization Recommendations: AI-powered suggestions for improvements
  • Data Quality Metrics: Comprehensive pipeline health scoring
# Generate executive report
debugger.generate_executive_report()

Color-Coded Status System

  • ๐ŸŸข Success: Operations completed without issues
  • ๐ŸŸก Warning: Performance bottlenecks detected
  • ๐Ÿ”ด Error: Failed operations requiring attention
  • ๐ŸŸฆ Info: Data flow and transformation indicators

๐Ÿš€ Features

PipelineDebugger

  • ๐Ÿ” Operation Tracking : Automatically track execution time, memory usage, and data shapes for each operation
  • ๐Ÿ“Š Enterprise-Grade Visualizations : Professional dashboards, 3D networks, and executive reports
  • ๐Ÿ’พ Memory Profiling : Monitor memory usage and identify memory-intensive operations
  • ๐Ÿ”— Data Lineage : Track data transformations and column changes throughout the pipeline
  • โš ๏ธ Bottleneck Detection : Automatically identify slow operations and memory peaks
  • ๐Ÿ“ˆ Performance Reports : Generate comprehensive debugging reports with optimization suggestions
  • ๐ŸŽฏ Error Tracking : Capture and track errors with full traceback information
  • ๐ŸŒณ Nested Operations : Support for tracking nested function calls and their relationships

๐Ÿ“ฆ Installation

pip install dataprobe

For development installation:

git clone https://github.com/santhoshkrishnan30/dataprobe.git
cd dataprobe
pip install -e ".[dev]"

๐ŸŽฏ Quick Start

Basic Usage with Enhanced Visualizations

from dataprobe import PipelineDebugger
import pandas as pd

# Initialize the debugger with enhanced features
debugger = PipelineDebugger(
    name="My_ETL_Pipeline",
    track_memory=True,
    track_lineage=True
)

# Use decorators to track operations
@debugger.track_operation("Load Data")
def load_data(file_path):
    return pd.read_csv(file_path)

@debugger.track_operation("Transform Data")
def transform_data(df):
    df['new_column'] = df['value'] * 2
    return df

# Run your pipeline
df = load_data("data.csv")
df = transform_data(df)

# Generate enterprise-grade visualizations
debugger.visualize_pipeline()              # Enterprise dashboard
debugger.create_3d_pipeline_visualization() # 3D network view  
debugger.generate_executive_report()       # Executive report

# Get AI-powered optimization suggestions
suggestions = debugger.suggest_optimizations()
for suggestion in suggestions:
    print(f"๐Ÿ’ก {suggestion['suggestion']}")

# Print summary and reports
debugger.print_summary()
report = debugger.generate_report()

Memory Profiling

@debugger.profile_memory
def memory_intensive_operation():
    large_df = pd.DataFrame(np.random.randn(1000000, 50))
    result = large_df.groupby(large_df.index % 1000).mean()
    return result

DataFrame Analysis

# Analyze DataFrames for potential issues
debugger.analyze_dataframe(df, name="Sales Data")

๐Ÿ“Š Example Output

Enterprise Dashboard

Professional KPI dashboard with real-time metrics, pipeline flowchart, memory analytics, and performance insights.

Pipeline Summary

Pipeline Summary: My_ETL_Pipeline
โ”œโ”€โ”€ Execution Statistics
โ”‚   โ”œโ”€โ”€ Total Operations: 5
โ”‚   โ”œโ”€โ”€ Total Duration: 2.34s
โ”‚   โ””โ”€โ”€ Total Memory Used: 125.6MB
โ”œโ”€โ”€ Bottlenecks (1)
โ”‚   โ””โ”€โ”€ Transform Data: 1.52s
โ””โ”€โ”€ Memory Peaks (1)
    โ””โ”€โ”€ Load Large Dataset: +85.3MB

Optimization Suggestions

๐Ÿ’ก OPTIMIZATION RECOMMENDATIONS:

1. [PERFORMANCE] Transform Data
   Issue: Operation took 1.52s
   ๐Ÿ’ก Consider optimizing this operation or parallelizing if possible

2. [MEMORY] Load Large Dataset  
   Issue: High memory usage: +85.3MB
   ๐Ÿ’ก Consider processing data in chunks or optimizing memory usage

๐Ÿ”ง Advanced Features

Multiple Visualization Options

# Enterprise dashboard - Professional KPI dashboard
debugger.visualize_pipeline()

# 3D network visualization - Interactive operation relationships  
debugger.create_3d_pipeline_visualization()

# Executive report - Multi-page stakeholder documentation
debugger.generate_executive_report()

Data Lineage Tracking

# Export data lineage information
lineage_json = debugger.export_lineage(format="json")

# Track column changes automatically
@debugger.track_operation("Add Features")
def add_features(df):
    df['feature_1'] = df['value'].rolling(7).mean()
    df['feature_2'] = df['value'].shift(1)
    return df

Custom Metadata

@debugger.track_operation("Process Batch", batch_id=123, source="api")
def process_batch(data):
    # Operation metadata is stored and included in reports
    return processed_data

Checkpoint Saving

# Auto-save is enabled by default
debugger = PipelineDebugger(name="Pipeline", auto_save=True)

# Manual checkpoint
debugger.save_checkpoint()

๐Ÿ“ˆ Performance Tips

  1. Use with Context : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:
   debugger = PipelineDebugger(name="Pipeline", track_memory=False, track_lineage=False)
  1. Batch Operations : Group small operations together to reduce tracking overhead
  2. Memory Monitoring : Set appropriate memory thresholds to catch issues early:
   debugger = PipelineDebugger(name="Pipeline", memory_threshold_mb=500)

๐Ÿ’ผ Enterprise Features

โœ… Professional Styling: Modern design matching enterprise standards โœ… Executive Ready: Suitable for stakeholder presentations โœ… Performance Insights: AI-powered optimization recommendations โœ… Export Options: High-resolution PNG outputs โœ… Responsive Design: Scales from detailed debugging to executive overview โœ… Real-time Metrics: Live performance and memory tracking

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built with Rich for beautiful terminal output
  • Uses NetworkX for pipeline visualization
  • Enhanced with Matplotlib and Seaborn for enterprise-grade visualizations
  • Inspired by the need for better data pipeline debugging tools

๐Ÿ“ž Support

๐Ÿ—บ๏ธ Roadmap

  • Enterprise-grade dashboard visualizations
  • 3D pipeline network views
  • Executive-level reporting capabilities
  • Support for distributed pipeline debugging
  • Integration with popular orchestration tools (Airflow, Prefect, Dagster)
  • Real-time pipeline monitoring dashboard
  • Advanced anomaly detection in data flow
  • Support for streaming data pipelines

Made with โค๏ธ by Santhosh Krishnan R

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprobe-1.0.0.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataprobe-1.0.0-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file dataprobe-1.0.0.tar.gz.

File metadata

  • Download URL: dataprobe-1.0.0.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for dataprobe-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2a6ca8b0e61ecab351dcfcb0c6a3a959253a29b5041b3d96184ce2a408e69fc2
MD5 2897faf711af0f58d726aa9d0b963fa0
BLAKE2b-256 ba1a5770c040890db153e6e4b206da98084cc178a2396cca8b38fad813bac1ed

See more details on using hashes here.

File details

Details for the file dataprobe-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dataprobe-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for dataprobe-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3235bc757c3d84ff19f0415977f44d60f0f6b090c89ed460abe0ce44e421b3da
MD5 feead159879dec248b6ab9c224d9ddad
BLAKE2b-256 1515d32e7e38ebc64c417a4fe43bcbea017f8250af77f685bd9b1e11321eb3b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page