Advanced data pipeline debugging and profiling tools for Python
Project description
Project description
DataProbe
DataProbe is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow with enterprise-grade visualizations.
๐จ NEW: Enterprise-Grade Visualizations
DataProbe v1.0.0 introduces comprehensive pipeline debugging capabilities with professional-quality visualizations, intelligent optimization recommendations, advanced memory profiling, data lineage tracking, and enterprise-grade reporting.
Dashboard Features
๐ข Enterprise Dashboard
- KPI Panels: Real-time success rates, duration, memory usage
- Pipeline Flowchart: Interactive operation flow with status indicators
- Performance Analytics: Memory usage timelines with peak detection
- Data Insights: Comprehensive lineage and transformation tracking
# Generate enterprise dashboard
debugger.visualize_pipeline()
๐ 3D Pipeline Network
- 3D Visualization: Interactive network showing operation relationships
- Performance Mapping: Z-axis represents operation duration
- Status Color-coding: Visual error and bottleneck identification
# Create 3D network visualization
debugger.create_3d_pipeline_visualization()
๐ Executive Reports
- Multi-page Reports: Professional stakeholder-ready documentation
- Performance Trends: Dual-axis charts showing duration and memory patterns
- Optimization Recommendations: AI-powered suggestions for improvements
- Data Quality Metrics: Comprehensive pipeline health scoring
# Generate executive report
debugger.generate_executive_report()
Color-Coded Status System
- ๐ข Success: Operations completed without issues
- ๐ก Warning: Performance bottlenecks detected
- ๐ด Error: Failed operations requiring attention
- ๐ฆ Info: Data flow and transformation indicators
๐ Features
PipelineDebugger
- ๐ Operation Tracking : Automatically track execution time, memory usage, and data shapes for each operation
- ๐ Enterprise-Grade Visualizations : Professional dashboards, 3D networks, and executive reports
- ๐พ Memory Profiling : Monitor memory usage and identify memory-intensive operations
- ๐ Data Lineage : Track data transformations and column changes throughout the pipeline
- โ ๏ธ Bottleneck Detection : Automatically identify slow operations and memory peaks
- ๐ Performance Reports : Generate comprehensive debugging reports with optimization suggestions
- ๐ฏ Error Tracking : Capture and track errors with full traceback information
- ๐ณ Nested Operations : Support for tracking nested function calls and their relationships
๐ฆ Installation
pip install dataprobe
For development installation:
git clone https://github.com/santhoshkrishnan30/dataprobe.git
cd dataprobe
pip install -e ".[dev]"
๐ฏ Quick Start
Basic Usage with Enhanced Visualizations
from dataprobe import PipelineDebugger
import pandas as pd
# Initialize the debugger with enhanced features
debugger = PipelineDebugger(
name="My_ETL_Pipeline",
track_memory=True,
track_lineage=True
)
# Use decorators to track operations
@debugger.track_operation("Load Data")
def load_data(file_path):
return pd.read_csv(file_path)
@debugger.track_operation("Transform Data")
def transform_data(df):
df['new_column'] = df['value'] * 2
return df
# Run your pipeline
df = load_data("data.csv")
df = transform_data(df)
# Generate enterprise-grade visualizations
debugger.visualize_pipeline() # Enterprise dashboard
debugger.create_3d_pipeline_visualization() # 3D network view
debugger.generate_executive_report() # Executive report
# Get AI-powered optimization suggestions
suggestions = debugger.suggest_optimizations()
for suggestion in suggestions:
print(f"๐ก {suggestion['suggestion']}")
# Print summary and reports
debugger.print_summary()
report = debugger.generate_report()
Memory Profiling
@debugger.profile_memory
def memory_intensive_operation():
large_df = pd.DataFrame(np.random.randn(1000000, 50))
result = large_df.groupby(large_df.index % 1000).mean()
return result
DataFrame Analysis
# Analyze DataFrames for potential issues
debugger.analyze_dataframe(df, name="Sales Data")
๐ Example Output
Enterprise Dashboard
Professional KPI dashboard with real-time metrics, pipeline flowchart, memory analytics, and performance insights.
Pipeline Summary
Pipeline Summary: My_ETL_Pipeline
โโโ Execution Statistics
โ โโโ Total Operations: 5
โ โโโ Total Duration: 2.34s
โ โโโ Total Memory Used: 125.6MB
โโโ Bottlenecks (1)
โ โโโ Transform Data: 1.52s
โโโ Memory Peaks (1)
โโโ Load Large Dataset: +85.3MB
Optimization Suggestions
๐ก OPTIMIZATION RECOMMENDATIONS:
1. [PERFORMANCE] Transform Data
Issue: Operation took 1.52s
๐ก Consider optimizing this operation or parallelizing if possible
2. [MEMORY] Load Large Dataset
Issue: High memory usage: +85.3MB
๐ก Consider processing data in chunks or optimizing memory usage
๐ง Advanced Features
Multiple Visualization Options
# Enterprise dashboard - Professional KPI dashboard
debugger.visualize_pipeline()
# 3D network visualization - Interactive operation relationships
debugger.create_3d_pipeline_visualization()
# Executive report - Multi-page stakeholder documentation
debugger.generate_executive_report()
Data Lineage Tracking
# Export data lineage information
lineage_json = debugger.export_lineage(format="json")
# Track column changes automatically
@debugger.track_operation("Add Features")
def add_features(df):
df['feature_1'] = df['value'].rolling(7).mean()
df['feature_2'] = df['value'].shift(1)
return df
Custom Metadata
@debugger.track_operation("Process Batch", batch_id=123, source="api")
def process_batch(data):
# Operation metadata is stored and included in reports
return processed_data
Checkpoint Saving
# Auto-save is enabled by default
debugger = PipelineDebugger(name="Pipeline", auto_save=True)
# Manual checkpoint
debugger.save_checkpoint()
๐ Performance Tips
- Use with Context : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:
debugger = PipelineDebugger(name="Pipeline", track_memory=False, track_lineage=False)
- Batch Operations : Group small operations together to reduce tracking overhead
- Memory Monitoring : Set appropriate memory thresholds to catch issues early:
debugger = PipelineDebugger(name="Pipeline", memory_threshold_mb=500)
๐ผ Enterprise Features
โ Professional Styling: Modern design matching enterprise standards โ Executive Ready: Suitable for stakeholder presentations โ Performance Insights: AI-powered optimization recommendations โ Export Options: High-resolution PNG outputs โ Responsive Design: Scales from detailed debugging to executive overview โ Real-time Metrics: Live performance and memory tracking
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built with Rich for beautiful terminal output
- Uses NetworkX for pipeline visualization
- Enhanced with Matplotlib and Seaborn for enterprise-grade visualizations
- Inspired by the need for better data pipeline debugging tools
๐ Support
- ๐ง Email: santhoshkrishnan3006@gmail.com
- ๐ Issues: GitHub Issues
- ๐ Documentation: Read the Docs
๐บ๏ธ Roadmap
- Enterprise-grade dashboard visualizations
- 3D pipeline network views
- Executive-level reporting capabilities
- Support for distributed pipeline debugging
- Integration with popular orchestration tools (Airflow, Prefect, Dagster)
- Real-time pipeline monitoring dashboard
- Advanced anomaly detection in data flow
- Support for streaming data pipelines
Made with โค๏ธ by Santhosh Krishnan R
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataprobe-1.0.0.tar.gz.
File metadata
- Download URL: dataprobe-1.0.0.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a6ca8b0e61ecab351dcfcb0c6a3a959253a29b5041b3d96184ce2a408e69fc2
|
|
| MD5 |
2897faf711af0f58d726aa9d0b963fa0
|
|
| BLAKE2b-256 |
ba1a5770c040890db153e6e4b206da98084cc178a2396cca8b38fad813bac1ed
|
File details
Details for the file dataprobe-1.0.0-py3-none-any.whl.
File metadata
- Download URL: dataprobe-1.0.0-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3235bc757c3d84ff19f0415977f44d60f0f6b090c89ed460abe0ce44e421b3da
|
|
| MD5 |
feead159879dec248b6ab9c224d9ddad
|
|
| BLAKE2b-256 |
1515d32e7e38ebc64c417a4fe43bcbea017f8250af77f685bd9b1e11321eb3b5
|