Skip to main content

Enterprise-grade Python data lineage tracking library with automatic pandas integration, perfect memory optimization, and comprehensive visualization capabilities.

Project description

🚀 DataLineagePy 3.0

Enterprise-Grade Python Data Lineage Tracking

Python 3.8+ License: MIT Production Ready Performance Score Enterprise Grade


DataLineagePy Logo

Beautiful, Powerful, and Effortless Data Lineage for Python

Track, visualize, and govern your data pipelines with zero friction.


🌟 Why DataLineagePy?

  • Automatic, column-level lineage tracking for all pandas DataFrames
  • Enterprise performance: memory-optimized, scalable, and production-ready
  • Stunning visualizations: interactive dashboards, HTML, PNG, SVG, and more
  • Plug-and-play connectors: MySQL, PostgreSQL, SQLite, and custom sources
  • Security & compliance: RBAC, AES-256 encryption, audit trails
  • Real-time collaboration: WebSocket server/client for team workflows
  • ML/AI pipeline tracking: Full auditability for machine learning steps
  • Cloud-native deployment: Docker, Kubernetes, Helm, Terraform

📋 Table of Contents


🚀 Quick Start

pip install datalineagepy
from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
tracker = LineageTracker(name="demo")
ldf = LineageDataFrame(df, name="my_df", tracker=tracker)
ldf2 = ldf.filter(ldf._df['a'] > 1)
ldf3 = ldf2.assign(c=ldf2._df['a'] + ldf2._df['b'])
tracker.visualize()  # Interactive HTML dashboard
tracker.export_lineage("lineage.json")

💾 Installation

  • PyPI: pip install datalineagepy
  • With visualization: pip install datalineagepy[viz]
  • All features: pip install datalineagepy[all]
  • Conda: conda install -c conda-forge datalineagepy (coming soon)
  • Docker: docker pull datalineagepy/datalineagepy:latest

See Installation Guide for advanced and enterprise setup.


📚 Core Features

  • Automatic lineage tracking for pandas DataFrames
  • Data validation: completeness, uniqueness, range, custom rules
  • Profiling & analytics: quality scoring, missing data, correlations
  • Visualization: HTML, PNG, SVG, interactive dashboards
  • Performance monitoring: execution time, memory, alerts
  • Security: RBAC, AES-256 encryption, audit trail
  • Custom connectors: SDK for any data source
  • Versioning: save, diff, rollback lineage graphs
  • Collaboration: real-time editing/viewing
  • ML/AI pipeline tracking: AutoMLTracker for full auditability

🔧 Usage Guide

1. Lineage Tracking

from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd
tracker = LineageTracker(name="my_pipeline")
df = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6]})
ldf = LineageDataFrame(df, name="input", tracker=tracker)
ldf2 = ldf.assign(z=ldf._df['x'] + ldf._df['y'])
print(tracker.export_graph())

2. Data Validation

from datalineagepy.core.validation import DataValidator
validator = DataValidator(tracker)
rules = {'completeness': {'threshold': 0.9}, 'uniqueness': {'columns': ['x']}}
results = validator.validate_dataframe(ldf, rules)
print(results)

3. Profiling & Analytics

from datalineagepy.core.analytics import DataProfiler
profiler = DataProfiler(tracker)
profile = profiler.profile_dataset(ldf, include_correlations=True)
print(profile)

4. Visualization & Reporting

from datalineagepy.visualization.graph_visualizer import GraphVisualizer
visualizer = GraphVisualizer(tracker)
visualizer.generate_html("lineage.html")
visualizer.generate_png("lineage.png")

5. Performance Monitoring

from datalineagepy.core.performance import PerformanceMonitor
monitor = PerformanceMonitor(tracker)
monitor.start_monitoring()
_ = ldf._df.sum()
monitor.stop_monitoring()
print(monitor.get_performance_summary())

6. Security & Compliance

from datalineagepy.security.rbac import RBACManager
rbac = RBACManager()
rbac.add_role('admin', ['read', 'write'])
rbac.add_user('alice', ['admin'])
print(rbac.check_access('alice', 'write'))

from datalineagepy.security.encryption.data_encryption import EncryptionManager
import os
os.environ['MASTER_ENCRYPTION_KEY'] = 'supersecretkey1234567890123456'
enc_mgr = EncryptionManager()
secret = 'Sensitive Data'
encrypted = enc_mgr.encrypt_sensitive_data(secret)
decrypted = enc_mgr.decrypt_sensitive_data(encrypted)
print(decrypted)

7. Database Connectors

from datalineagepy.connectors.database.mysql_connector import MySQLConnector
from datalineagepy.core import LineageTracker
db_config = {'host': 'localhost', 'user': 'root', 'password': 'password', 'database': 'test_db'}
tracker = LineageTracker()
conn = MySQLConnector(**db_config, lineage_tracker=tracker)
conn.execute_query('SELECT * FROM test_table')
conn.close()

8. ML/AI Pipeline Tracking

from datalineagepy import AutoMLTracker
tracker = AutoMLTracker(name='ml_pipeline')
tracker.log_step('fit', model='LogisticRegression', params={'solver': 'lbfgs'})
tracker.log_step('predict', model='LogisticRegression')
print(tracker.export_ai_ready_format())

📊 Visualization & Reporting

  • Interactive HTML dashboards: tracker.visualize()
  • Export formats: JSON, DOT, PNG, SVG, Excel, CSV
  • Custom visualizations: Use GraphVisualizer for advanced needs

🗄️ Database Connectors

  • MySQL, PostgreSQL, SQLite: Full lineage tracking for every query
  • Custom connectors: Build your own with the SDK
  • See Database Connectors Guide

⚡ Performance Monitoring

  • Track execution time, memory, and operation stats
  • Alerting: Slack, Email, custom hooks
  • Production monitoring: Integrate with Prometheus, Grafana, etc.

🔒 Security & Compliance

  • RBAC: Role-based access control for users and actions
  • AES-256 encryption: At-rest and in-transit data protection
  • Audit trail: Full operation history for compliance

🤖 ML/AI Pipeline Tracking

  • AutoMLTracker: Log, audit, and export every ML pipeline step
  • Explainability: Export pipeline steps for downstream analysis

☁️ Enterprise Deployment

  • Docker, Kubernetes, Helm, Terraform: Cloud-native ready
  • Production scripts: See deploy/ for examples

💡 Use Cases

  • Data science: Reproducibility, experiment tracking, Jupyter integration
  • Enterprise ETL: Production pipelines, data quality, compliance
  • Data governance: Impact analysis, documentation, audit trails
  • ML/AI: Pipeline explainability, model audit, feature tracking

📖 Documentation


🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.


📄 License

MIT License. See LICENSE for details.


DataLineagePy 3.0 — The new standard for Python data lineage
Beautiful. Powerful. Effortless.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalineagepy-3.0.1.tar.gz (584.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalineagepy-3.0.1-py3-none-any.whl (389.3 kB view details)

Uploaded Python 3

File details

Details for the file datalineagepy-3.0.1.tar.gz.

File metadata

  • Download URL: datalineagepy-3.0.1.tar.gz
  • Upload date:
  • Size: 584.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datalineagepy-3.0.1.tar.gz
Algorithm Hash digest
SHA256 14ec7bf68d019c94f4159ba0d391a75bc56f9f1e9ea973c91462147831f675a7
MD5 c34a5bd877a3c6ea1e6a067151eb988a
BLAKE2b-256 08d864623316e1a5024c53ed969a21cb8a94b6592d7f06043e7582735eec5599

See more details on using hashes here.

File details

Details for the file datalineagepy-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: datalineagepy-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 389.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datalineagepy-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c42e3157725b236d1388d1bc16acef16dc527647e791d7f964476ad88892c037
MD5 0643e60e37234530333e0b785398ea88
BLAKE2b-256 a4900799462acf162985c95188f3f9abdd0b76fcea3d34cb55b0cdfe458df43d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page