Enterprise-grade Python data lineage tracking library with automatic pandas integration, perfect memory optimization, and comprehensive visualization capabilities.
Project description
🚀 DataLineagePy 3.0
Enterprise-Grade Python Data Lineage Tracking
Beautiful, Powerful, and Effortless Data Lineage for Python
Track, visualize, and govern your data pipelines with zero friction.
🌟 Why DataLineagePy?
- Automatic, column-level lineage tracking for all pandas DataFrames
- Enterprise performance: memory-optimized, scalable, and production-ready
- Stunning visualizations: interactive dashboards, HTML, PNG, SVG, and more
- Plug-and-play connectors: MySQL, PostgreSQL, SQLite, and custom sources
- Security & compliance: RBAC, AES-256 encryption, audit trails
- Real-time collaboration: WebSocket server/client for team workflows
- ML/AI pipeline tracking: Full auditability for machine learning steps
- Cloud-native deployment: Docker, Kubernetes, Helm, Terraform
📋 Table of Contents
- Quick Start
- Installation
- Core Features
- Usage Guide
- Database Connectors
- Visualization & Reporting
- Performance Monitoring
- Security & Compliance
- ML/AI Pipeline Tracking
- Enterprise Deployment
- Use Cases
- Documentation
- Contributing
- License
🚀 Quick Start
pip install datalineagepy
from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
tracker = LineageTracker(name="demo")
ldf = LineageDataFrame(df, name="my_df", tracker=tracker)
ldf2 = ldf.filter(ldf._df['a'] > 1)
ldf3 = ldf2.assign(c=ldf2._df['a'] + ldf2._df['b'])
tracker.visualize() # Interactive HTML dashboard
tracker.export_lineage("lineage.json")
💾 Installation
- PyPI:
pip install datalineagepy - With visualization:
pip install datalineagepy[viz] - All features:
pip install datalineagepy[all] - Conda:
conda install -c conda-forge datalineagepy(coming soon) - Docker:
docker pull datalineagepy/datalineagepy:latest
See Installation Guide for advanced and enterprise setup.
📚 Core Features
- Automatic lineage tracking for pandas DataFrames
- Data validation: completeness, uniqueness, range, custom rules
- Profiling & analytics: quality scoring, missing data, correlations
- Visualization: HTML, PNG, SVG, interactive dashboards
- Performance monitoring: execution time, memory, alerts
- Security: RBAC, AES-256 encryption, audit trail
- Custom connectors: SDK for any data source
- Versioning: save, diff, rollback lineage graphs
- Collaboration: real-time editing/viewing
- ML/AI pipeline tracking: AutoMLTracker for full auditability
🔧 Usage Guide
1. Lineage Tracking
from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd
tracker = LineageTracker(name="my_pipeline")
df = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6]})
ldf = LineageDataFrame(df, name="input", tracker=tracker)
ldf2 = ldf.assign(z=ldf._df['x'] + ldf._df['y'])
print(tracker.export_graph())
2. Data Validation
from datalineagepy.core.validation import DataValidator
validator = DataValidator(tracker)
rules = {'completeness': {'threshold': 0.9}, 'uniqueness': {'columns': ['x']}}
results = validator.validate_dataframe(ldf, rules)
print(results)
3. Profiling & Analytics
from datalineagepy.core.analytics import DataProfiler
profiler = DataProfiler(tracker)
profile = profiler.profile_dataset(ldf, include_correlations=True)
print(profile)
4. Visualization & Reporting
from datalineagepy.visualization.graph_visualizer import GraphVisualizer
visualizer = GraphVisualizer(tracker)
visualizer.generate_html("lineage.html")
visualizer.generate_png("lineage.png")
5. Performance Monitoring
from datalineagepy.core.performance import PerformanceMonitor
monitor = PerformanceMonitor(tracker)
monitor.start_monitoring()
_ = ldf._df.sum()
monitor.stop_monitoring()
print(monitor.get_performance_summary())
6. Security & Compliance
from datalineagepy.security.rbac import RBACManager
rbac = RBACManager()
rbac.add_role('admin', ['read', 'write'])
rbac.add_user('alice', ['admin'])
print(rbac.check_access('alice', 'write'))
from datalineagepy.security.encryption.data_encryption import EncryptionManager
import os
os.environ['MASTER_ENCRYPTION_KEY'] = 'supersecretkey1234567890123456'
enc_mgr = EncryptionManager()
secret = 'Sensitive Data'
encrypted = enc_mgr.encrypt_sensitive_data(secret)
decrypted = enc_mgr.decrypt_sensitive_data(encrypted)
print(decrypted)
7. Database Connectors
from datalineagepy.connectors.database.mysql_connector import MySQLConnector
from datalineagepy.core import LineageTracker
db_config = {'host': 'localhost', 'user': 'root', 'password': 'password', 'database': 'test_db'}
tracker = LineageTracker()
conn = MySQLConnector(**db_config, lineage_tracker=tracker)
conn.execute_query('SELECT * FROM test_table')
conn.close()
8. ML/AI Pipeline Tracking
from datalineagepy import AutoMLTracker
tracker = AutoMLTracker(name='ml_pipeline')
tracker.log_step('fit', model='LogisticRegression', params={'solver': 'lbfgs'})
tracker.log_step('predict', model='LogisticRegression')
print(tracker.export_ai_ready_format())
📊 Visualization & Reporting
- Interactive HTML dashboards:
tracker.visualize() - Export formats: JSON, DOT, PNG, SVG, Excel, CSV
- Custom visualizations: Use
GraphVisualizerfor advanced needs
🗄️ Database Connectors
- MySQL, PostgreSQL, SQLite: Full lineage tracking for every query
- Custom connectors: Build your own with the SDK
- See Database Connectors Guide
⚡ Performance Monitoring
- Track execution time, memory, and operation stats
- Alerting: Slack, Email, custom hooks
- Production monitoring: Integrate with Prometheus, Grafana, etc.
🔒 Security & Compliance
- RBAC: Role-based access control for users and actions
- AES-256 encryption: At-rest and in-transit data protection
- Audit trail: Full operation history for compliance
🤖 ML/AI Pipeline Tracking
- AutoMLTracker: Log, audit, and export every ML pipeline step
- Explainability: Export pipeline steps for downstream analysis
☁️ Enterprise Deployment
- Docker, Kubernetes, Helm, Terraform: Cloud-native ready
- Production scripts: See
deploy/for examples
💡 Use Cases
- Data science: Reproducibility, experiment tracking, Jupyter integration
- Enterprise ETL: Production pipelines, data quality, compliance
- Data governance: Impact analysis, documentation, audit trails
- ML/AI: Pipeline explainability, model audit, feature tracking
📖 Documentation
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
📄 License
MIT License. See LICENSE for details.
Beautiful. Powerful. Effortless.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datalineagepy-3.0.3.tar.gz.
File metadata
- Download URL: datalineagepy-3.0.3.tar.gz
- Upload date:
- Size: 584.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac6488122b6a4f0b503d8a1e605d7b8dbce70b160fc3d22049e99f6af5106d94
|
|
| MD5 |
31986696b6345cfe3fbc26c6724c426e
|
|
| BLAKE2b-256 |
95ee07b46da8838d550e8a31c60090d42d92da271c12d838827dd372a89838d2
|
File details
Details for the file datalineagepy-3.0.3-py3-none-any.whl.
File metadata
- Download URL: datalineagepy-3.0.3-py3-none-any.whl
- Upload date:
- Size: 389.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec0eac82c32bb78de59ffaf37bd08c707623c370ffe9c4ee2aad19658f8c99e4
|
|
| MD5 |
d1aea352737bc9641cc8ed1daed66995
|
|
| BLAKE2b-256 |
9a33a774c64a374918b5d9360aac08bf67a0198dd4516b4ed7e02d2ec329ea76
|