Skip to main content

Enterprise-grade Python data lineage tracking library with automatic pandas integration, perfect memory optimization, and comprehensive visualization capabilities.

Project description

🚀 DataLineagePy 3.0

Enterprise-Grade Python Data Lineage Tracking

Python 3.8+ License: MIT Production Ready Performance Score Enterprise Grade


DataLineagePy Logo

Beautiful, Powerful, and Effortless Data Lineage for Python

Track, visualize, and govern your data pipelines with zero friction.


🌟 Why DataLineagePy?

  • Automatic, column-level lineage tracking for all pandas DataFrames
  • Enterprise performance: memory-optimized, scalable, and production-ready
  • Stunning visualizations: interactive dashboards, HTML, PNG, SVG, and more
  • Plug-and-play connectors: MySQL, PostgreSQL, SQLite, and custom sources
  • Security & compliance: RBAC, AES-256 encryption, audit trails
  • Real-time collaboration: WebSocket server/client for team workflows
  • ML/AI pipeline tracking: Full auditability for machine learning steps
  • Cloud-native deployment: Docker, Kubernetes, Helm, Terraform

📋 Table of Contents


🚀 Quick Start

pip install datalineagepy
from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
tracker = LineageTracker(name="demo")
ldf = LineageDataFrame(df, name="my_df", tracker=tracker)
ldf2 = ldf.filter(ldf._df['a'] > 1)
ldf3 = ldf2.assign(c=ldf2._df['a'] + ldf2._df['b'])
tracker.visualize()  # Interactive HTML dashboard
tracker.export_lineage("lineage.json")

💾 Installation

  • PyPI: pip install datalineagepy
  • With visualization: pip install datalineagepy[viz]
  • All features: pip install datalineagepy[all]
  • Conda: conda install -c conda-forge datalineagepy (coming soon)
  • Docker: docker pull datalineagepy/datalineagepy:latest

See Installation Guide for advanced and enterprise setup.


📚 Core Features

  • Automatic lineage tracking for pandas DataFrames
  • Data validation: completeness, uniqueness, range, custom rules
  • Profiling & analytics: quality scoring, missing data, correlations
  • Visualization: HTML, PNG, SVG, interactive dashboards
  • Performance monitoring: execution time, memory, alerts
  • Security: RBAC, AES-256 encryption, audit trail
  • Custom connectors: SDK for any data source
  • Versioning: save, diff, rollback lineage graphs
  • Collaboration: real-time editing/viewing
  • ML/AI pipeline tracking: AutoMLTracker for full auditability

🔧 Usage Guide

1. Lineage Tracking

from datalineagepy import LineageTracker, LineageDataFrame
import pandas as pd
tracker = LineageTracker(name="my_pipeline")
df = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6]})
ldf = LineageDataFrame(df, name="input", tracker=tracker)
ldf2 = ldf.assign(z=ldf._df['x'] + ldf._df['y'])
print(tracker.export_graph())

2. Data Validation

from datalineagepy.core.validation import DataValidator
validator = DataValidator(tracker)
rules = {'completeness': {'threshold': 0.9}, 'uniqueness': {'columns': ['x']}}
results = validator.validate_dataframe(ldf, rules)
print(results)

3. Profiling & Analytics

from datalineagepy.core.analytics import DataProfiler
profiler = DataProfiler(tracker)
profile = profiler.profile_dataset(ldf, include_correlations=True)
print(profile)

4. Visualization & Reporting

from datalineagepy.visualization.graph_visualizer import GraphVisualizer
visualizer = GraphVisualizer(tracker)
visualizer.generate_html("lineage.html")
visualizer.generate_png("lineage.png")

5. Performance Monitoring

from datalineagepy.core.performance import PerformanceMonitor
monitor = PerformanceMonitor(tracker)
monitor.start_monitoring()
_ = ldf._df.sum()
monitor.stop_monitoring()
print(monitor.get_performance_summary())

6. Security & Compliance

from datalineagepy.security.rbac import RBACManager
rbac = RBACManager()
rbac.add_role('admin', ['read', 'write'])
rbac.add_user('alice', ['admin'])
print(rbac.check_access('alice', 'write'))

from datalineagepy.security.encryption.data_encryption import EncryptionManager
import os
os.environ['MASTER_ENCRYPTION_KEY'] = 'supersecretkey1234567890123456'
enc_mgr = EncryptionManager()
secret = 'Sensitive Data'
encrypted = enc_mgr.encrypt_sensitive_data(secret)
decrypted = enc_mgr.decrypt_sensitive_data(encrypted)
print(decrypted)

7. Database Connectors

from datalineagepy.connectors.database.mysql_connector import MySQLConnector
from datalineagepy.core import LineageTracker
db_config = {'host': 'localhost', 'user': 'root', 'password': 'password', 'database': 'test_db'}
tracker = LineageTracker()
conn = MySQLConnector(**db_config, lineage_tracker=tracker)
conn.execute_query('SELECT * FROM test_table')
conn.close()

8. ML/AI Pipeline Tracking

from datalineagepy import AutoMLTracker
tracker = AutoMLTracker(name='ml_pipeline')
tracker.log_step('fit', model='LogisticRegression', params={'solver': 'lbfgs'})
tracker.log_step('predict', model='LogisticRegression')
print(tracker.export_ai_ready_format())

📊 Visualization & Reporting

  • Interactive HTML dashboards: tracker.visualize()
  • Export formats: JSON, DOT, PNG, SVG, Excel, CSV
  • Custom visualizations: Use GraphVisualizer for advanced needs

🗄️ Database Connectors

  • MySQL, PostgreSQL, SQLite: Full lineage tracking for every query
  • Custom connectors: Build your own with the SDK
  • See Database Connectors Guide

⚡ Performance Monitoring

  • Track execution time, memory, and operation stats
  • Alerting: Slack, Email, custom hooks
  • Production monitoring: Integrate with Prometheus, Grafana, etc.

🔒 Security & Compliance

  • RBAC: Role-based access control for users and actions
  • AES-256 encryption: At-rest and in-transit data protection
  • Audit trail: Full operation history for compliance

🤖 ML/AI Pipeline Tracking

  • AutoMLTracker: Log, audit, and export every ML pipeline step
  • Explainability: Export pipeline steps for downstream analysis

☁️ Enterprise Deployment

  • Docker, Kubernetes, Helm, Terraform: Cloud-native ready
  • Production scripts: See deploy/ for examples

💡 Use Cases

  • Data science: Reproducibility, experiment tracking, Jupyter integration
  • Enterprise ETL: Production pipelines, data quality, compliance
  • Data governance: Impact analysis, documentation, audit trails
  • ML/AI: Pipeline explainability, model audit, feature tracking

📖 Documentation


🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.


📄 License

MIT License. See LICENSE for details.


DataLineagePy 3.0 — The new standard for Python data lineage
Beautiful. Powerful. Effortless.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalineagepy-3.0.2.tar.gz (584.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalineagepy-3.0.2-py3-none-any.whl (389.3 kB view details)

Uploaded Python 3

File details

Details for the file datalineagepy-3.0.2.tar.gz.

File metadata

  • Download URL: datalineagepy-3.0.2.tar.gz
  • Upload date:
  • Size: 584.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datalineagepy-3.0.2.tar.gz
Algorithm Hash digest
SHA256 94c05c1270994be303a93aac83bfb5d0ac1bda50382ff6db8f913c9750c6763d
MD5 ded5ca015caa5c3978e638dedbbade59
BLAKE2b-256 6527cbb8e21846e7d0d1aed84c4a1fb48fcc4c61d23ee244c310fb940e114651

See more details on using hashes here.

File details

Details for the file datalineagepy-3.0.2-py3-none-any.whl.

File metadata

  • Download URL: datalineagepy-3.0.2-py3-none-any.whl
  • Upload date:
  • Size: 389.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datalineagepy-3.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 26bee3513169ed561139fc3c2f323bdcd1fd34bcfe824cde07a0496ab738aa0c
MD5 9edbe398dfef9028f616b838a1395c59
BLAKE2b-256 dc94f18c0ea9db6bf445d19d265489626062c39c0b08e9c72c3e7882860e15f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page