Skip to main content

Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)

Project description

InfoTracker

Column-level SQL lineage extraction and impact analysis for MS SQL Server

InfoTracker is a powerful command-line tool that parses T-SQL files and generates detailed column-level lineage in OpenLineage format. It supports advanced SQL Server features including table-valued functions, stored procedures, temp tables, and EXEC patterns.

Python License PyPI

🚀 Features

  • Column-level lineage - Track data flow at the column level with precise transformations
  • Advanced SQL support - T-SQL dialect with temp tables, variables, CTEs, and window functions
  • Impact analysis - Find upstream and downstream dependencies with flexible selectors
  • Wildcard matching - Support for table wildcards (schema.table.*) and column wildcards (..pattern)
  • Breaking change detection - Detect schema changes that could break downstream processes
  • Multiple output formats - Text tables or JSON for integration with other tools
  • OpenLineage compatible - Standard format for data lineage interoperability
  • dbt (compiled SQL) support - Run on compiled dbt models with --dbt
  • Rich HTML viz - Zoom/pan, column search, per‑attribute isolate (UP/DOWN/BOTH), sidebar resize and select/clear all
  • Advanced SQL objects - Table-valued functions (TVF) and dataset-returning procedures
  • Temp table tracking - Full lineage through EXEC into temp tables

📦 Installation

From PyPI (Recommended)

pip install InfoTracker

From GitHub

# Latest stable release
pip install git+https://github.com/InfoMatePL/InfoTracker.git

# Development version
git clone https://github.com/InfoMatePL/InfoTracker.git
cd InfoTracker
pip install -e .

Verify Installation

infotracker --help

⚡ Quick Start

1. Extract Lineage

# Extract lineage from SQL files
infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage

# Extract lineage from compiled dbt models
infotracker extract --dbt --sql-dir examples/dbt_warehouse/models --out-dir build/dbt_lineage

Flags:

  • --sql-dir DIR Directory with .sql files (required)
  • --out-dir DIR Output folder for lineage artifacts (default from config or build/lineage)
  • --adapter NAME SQL dialect adapter (default from config)
  • --catalog FILE Optional YAML catalog with schemas
  • --fail-on-warn Exit non-zero if warnings occurred
  • --include PATTERN Glob include filter
  • --exclude PATTERN Glob exclude filter
  • --encoding NAME File encoding for SQL files (default: auto)
  • --dbt Enable dbt mode (compiled SQL)

2. Run Impact Analysis

# Find what feeds into a column (upstream)
infotracker impact -s "+STG.dbo.Orders.OrderID" --graph-dir build/lineage

# Find what uses a column (downstream)  
infotracker impact -s "STG.dbo.Orders.OrderID+" --graph-dir build/lineage

# Both directions
infotracker impact -s "+dbo.fct_sales.Revenue+" --graph-dir build/lineage

Flags:

  • -s, --selector TEXT Column selector; use + for direction markers (required)
  • --graph-dir DIR Folder with column_graph.json (required; produced by extract)
  • --max-depth N Traversal depth; 0 = unlimited (full lineage). Default: 0
  • --out PATH Write output to file instead of stdout
  • --format text|json Output format (set globally or per-invocation)

3. Detect Breaking Changes

# Compare two versions of your schema
infotracker diff --base build/lineage --head build/lineage_new

Flags:

  • --base DIR Folder with base artifacts (required)
  • --head DIR Folder with head artifacts (required)
  • --format text|json Output format
  • --threshold LEVEL Severity threshold: NON_BREAKING|POTENTIALLY_BREAKING|BREAKING

4. Visualize the Graph

# Generate an interactive HTML graph (lineage_viz.html) for a built graph
infotracker viz --graph-dir build/lineage

Flags:

  • --graph-dir DIR Folder with column_graph.json (required)
  • --out PATH Output HTML path (default: <graph_dir>/lineage_viz.html) Open the generated lineage_viz.html in your browser. You can click a column to highlight upstream/downstream lineage; press Enter in the search box to highlight all matches. By default, the canvas is empty. Use the left sidebar to toggle objects on (checkboxes are initially unchecked).

📖 Selector Syntax

InfoTracker supports flexible column selectors for precise impact analysis:

Selector Format Description Example
table.column Simple format (adds default dbo schema) Orders.OrderID
schema.table.column Schema-qualified format dbo.Orders.OrderID
database.schema.table.column Database-qualified format STG.dbo.Orders.OrderID
schema.table.* Table wildcard (all columns) dbo.fct_sales.*
..pattern Column wildcard (name contains pattern) ..revenue
..pattern* Column wildcard with fnmatch ..customer*

Direction Control

  • selector - downstream dependencies (default)
  • +selector - upstream sources
  • selector+ - downstream dependencies (explicit)
  • +selector+ - both upstream and downstream

💡 Examples

Basic Usage

# Extract lineage first (always run this before impact analysis)
infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage

# Basic column lineage
infotracker impact -s "+dbo.fct_sales.Revenue" --graph-dir build/lineage        # What feeds this column?
infotracker impact -s "STG.dbo.Orders.OrderID+" --graph-dir build/lineage      # What uses this column?

Wildcard Selectors

# All columns from a specific table
infotracker impact -s "dbo.fct_sales.*" --graph-dir build/lineage
infotracker impact -s "STG.dbo.Orders.*" --graph-dir build/lineage

# Find all columns containing "revenue" (case-insensitive)
infotracker impact -s "..revenue" --graph-dir build/lineage

# Find all columns starting with "customer" 
infotracker impact -s "..customer*" --graph-dir build/lineage

Advanced SQL Objects

# Table-valued function columns (upstream)
infotracker impact -s "+dbo.fn_customer_orders_tvf.*" --graph-dir build/lineage

# Procedure dataset columns (upstream)  
infotracker impact -s "+dbo.usp_customer_metrics_dataset.*" --graph-dir build/lineage

# Temp table lineage from EXEC
infotracker impact -s "+#temp_table.*" --graph-dir build/lineage

Output Formats

# Text output (default, human-readable)
infotracker impact -s "+..revenue" --graph-dir build/lineage

# JSON output (machine-readable)
infotracker --format json impact -s "..customer*" --graph-dir build/lineage > customer_lineage.json

# Control traversal depth
infotracker impact -s "+dbo.Orders.OrderID" --max-depth 2 --graph-dir build/lineage
# Note: --max-depth defaults to 0 (unlimited / full lineage)

Breaking Change Detection

# Extract baseline
infotracker extract --sql-dir sql_v1 --out-dir build/baseline

# Extract new version  
infotracker extract --sql-dir sql_v2 --out-dir build/current

# Detect breaking changes
infotracker diff --base build/baseline --head build/current

# Filter by severity
infotracker diff --base build/baseline --head build/current --threshold BREAKING

Output Format

Impact analysis returns these columns (topologically sorted by level):

  • from - Source column (fully qualified)
  • to - Target column (fully qualified)
  • direction - upstream or downstream
  • transformation - Type of transformation (IDENTITY, ARITHMETIC, AGGREGATION, CASE_AGGREGATION, DATE_FUNCTION, WINDOW, etc.). For UX clarity, CAST and CASE are shown as expression.
  • description - Human-readable transformation description
  • level - Topological distance from the selected column (1 = direct neighbor, then 2, 3, …)

Results are automatically deduplicated and sorted topologically by level (then direction/from/to). Use --format json for machine-readable output.

New Transformation Types

The enhanced transformation taxonomy includes:

  • ARITHMETIC_AGGREGATION - Arithmetic operations combined with aggregation functions
  • COMPLEX_AGGREGATION - Multi-step calculations involving multiple aggregations
  • DATE_FUNCTION - Date/time calculations like DATEDIFF, DATEADD
  • DATE_FUNCTION_AGGREGATION - Date functions applied to aggregated results
  • CASE_AGGREGATION - CASE statements applied to aggregated results

Advanced Object Support

InfoTracker now supports advanced SQL Server objects:

Table-Valued Functions (TVF):

  • Inline TVF (RETURN AS SELECT) - Parsed directly from SELECT statement
  • Multi-statement TVF (RETURN @table TABLE) - Extracts schema from table variable definition
  • Function parameters are tracked as filter metadata (don't create columns)

Dataset-Returning Procedures:

  • Procedures ending with SELECT statement are treated as dataset sources
  • Output schema extracted from the final SELECT statement
  • Parameters tracked as filter metadata affecting lineage scope

EXEC into Temp Tables:

  • INSERT INTO #temp EXEC procedure patterns create edges from procedure columns to temp table columns
  • Temp table lineage propagates downstream to final targets
  • Supports complex workflow patterns combining functions, procedures, and temp tables

Configuration

InfoTracker follows this configuration precedence:

  1. CLI flags (highest priority) - override everything
  2. infotracker.yml config file - project defaults
  3. Built-in defaults (lowest priority) - fallback values

🔧 Configuration

Create an infotracker.yml file in your project root:

sql_dirs:
  - "sql/"
  - "models/"
out_dir: "build/lineage"
exclude_dirs: 
  - "__pycache__"
  - ".git"
severity_threshold: "POTENTIALLY_BREAKING"

Configuration Options

Setting Description Default Examples
sql_dirs Directories to scan for SQL files ["."] ["sql/", "models/"]
out_dir Output directory for lineage files "lineage" "build/artifacts"
exclude_dirs Directories to skip [] ["__pycache__", "node_modules"]
severity_threshold Breaking change detection level "NON_BREAKING" "BREAKING"

📚 Documentation

🖼 Visualization (viz)

Generate an interactive HTML to explore column-level lineage:

# After extract (column_graph.json present in the folder)
infotracker viz --graph-dir build/lineage

# Options
#   --out <path>      Output HTML path (default: <graph_dir>/lineage_viz.html)
#   --graph-dir       Folder z column_graph.json [required]

Tips:

  • Search supports table names, full IDs (namespace.schema.table), column names, and URIs. Press Enter to highlight all matches.
  • Click a column to switch into lineage mode (upstream/downstream highlight). Clicking another column clears the previous selection.
  • Right‑click a column row to open a context menu: Show upstream, Show downstream, Show both, Clear filter. In isolate mode only the path columns and edges remain visible (background clicks won’t clear; use Clear filter).
  • Left sidebar: live filter (matches tables and column names), Select All / Clear buttons, and a draggable resizer between sidebar and canvas. Sidebar toggle remembers last width.
  • Depth input in the toolbar limits neighbor layers rendered around selected tables.
  • Collapse button toggles between full column rows and compact “object‑only” view (single arrows object→object).
  • Column order in cards follows DDL/Schema order (from OpenLineage artifacts) instead of alphabetical.

🧪 Testing

# Run all tests
pytest

# Run specific test categories
pytest tests/test_parser.py     # Parser functionality
pytest tests/test_wildcard.py   # Wildcard selectors
pytest tests/test_adapter.py    # SQL dialect adapters

# Run with coverage
pytest --cov=infotracker --cov-report=html

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments


InfoTracker - Making database schema evolution safer, one column at a time. 🎯

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infotracker-0.7.1.tar.gz (303.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infotracker-0.7.1-py3-none-any.whl (201.1 kB view details)

Uploaded Python 3

File details

Details for the file infotracker-0.7.1.tar.gz.

File metadata

  • Download URL: infotracker-0.7.1.tar.gz
  • Upload date:
  • Size: 303.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for infotracker-0.7.1.tar.gz
Algorithm Hash digest
SHA256 79f5a7a57e11b6a3dc75db43432225f617c8b7c59ce6d6b9803a3b1b155f967b
MD5 81e319a6ef9747a7cf50e5a244ab53b9
BLAKE2b-256 f07c5a561ff015e4850ab2c058da2cc16234ae8ceee7fd8df7a84e8801d49fcb

See more details on using hashes here.

File details

Details for the file infotracker-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: infotracker-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 201.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for infotracker-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 61a0dc792fe1f55292820bcba0f44acca67ff6bc5a86cd2c9cdd3269c37b0027
MD5 d901a531ebf12f38fc29be19d925d6d8
BLAKE2b-256 b202e7194efdc4faf0dd752a057e1f330febbf2c3b80ab80472c1e859f05c6a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page