Skip to main content

Modern data profiling and drift detection framework

Project description

๐Ÿงฉ Baselinr

CI License: BSL 1.1 Python 3.10+

Baselinr is a modern, open-source data profiling and drift detection framework for SQL-based data warehouses. It automatically profiles datasets, stores metadata and statistics, and detects drift over time.

๐Ÿš€ Features

  • Automated Profiling: Profile tables with column-level metrics (count, null %, distinct values, mean, stddev, histograms, etc.)
  • Drift Detection: Compare profiling runs to detect schema and statistical drift with configurable strategies
  • Type-Specific Thresholds: Adjust drift sensitivity based on column data type (numeric, categorical, timestamp, boolean) to reduce false positives
  • Intelligent Baseline Selection: Automatically selects optimal baseline method (last run, moving average, prior period, stable window) based on column characteristics
  • Advanced Statistical Tests: Kolmogorov-Smirnov (KS) test, Population Stability Index (PSI), Chi-square, Entropy, and more for rigorous drift detection
  • Event & Alert Hooks: Pluggable event system for real-time alerts and notifications on drift, schema changes, and profiling lifecycle events
  • Partition-Aware Profiling: Intelligent partition handling with strategies for latest, recent_n, or sample partitions
  • Adaptive Sampling: Multiple sampling methods (random, stratified, top-k) for efficient profiling of large datasets
  • Multi-Database Support: Works with PostgreSQL, Snowflake, SQLite, MySQL, BigQuery, and Redshift
  • Schema Versioning & Migrations: Built-in schema version management with migration system for safe database schema evolution
  • Metadata Querying: Powerful CLI and API for querying profiling runs, drift events, and table history
  • Dagster Integration: Built-in orchestration support with Dagster assets and schedules
  • Configuration-Driven: Simple YAML/JSON configuration for defining profiling targets
  • Historical Tracking: Store profiling results over time for trend analysis
  • CLI Interface: Comprehensive command-line interface for profiling, drift detection, querying, and schema management

๐Ÿ“‹ Requirements

  • Python 3.10+
  • One of the supported databases: PostgreSQL, Snowflake, SQLite, MySQL, BigQuery, or Redshift

๐Ÿ”ง Installation

Basic Installation

pip install -e .

With Snowflake Support

pip install -e ".[snowflake]"

With Dagster Integration

pip install -e ".[dagster]"

Full Installation (All Features)

pip install -e ".[all]"

๐Ÿ“š Documentation

All documentation has been organized into the docs/ directory:

See docs/README.md for the complete documentation index.

๐Ÿƒ Quick Start

1. Create a Configuration File

Create a config.yml file:

environment: development

source:
  type: postgres
  host: localhost
  port: 5432
  database: mydb
  username: user
  password: password
  schema: public

storage:
  connection:
    type: postgres
    host: localhost
    port: 5432
    database: mydb
    username: user
    password: password
  results_table: baselinr_results
  runs_table: baselinr_runs
  create_tables: true

profiling:
  tables:
    - table: customers
      sample_ratio: 1.0
    - table: orders
      sample_ratio: 1.0
  
  default_sample_ratio: 1.0
  compute_histograms: true
  histogram_bins: 10

2. Preview What Will Be Profiled

baselinr plan --config config.yml

This shows you what tables will be profiled without actually running the profiler.

3. Run Profiling

baselinr profile --config config.yml

4. Detect Drift

After running profiling multiple times:

baselinr drift --config config.yml --dataset customers

5. Query Profiling Metadata

Query your profiling history and drift events:

# List recent profiling runs
baselinr query runs --config config.yml --limit 10

# Query drift events
baselinr query drift --config config.yml --table customers --days 7

# Get detailed run information
baselinr query run --config config.yml --run-id <run-id>

# View table profiling history
baselinr query table --config config.yml --table customers --days 30

6. Manage Schema Migrations

Check and apply schema migrations:

# Check schema version status
baselinr migrate status --config config.yml

# Apply migrations to latest version
baselinr migrate apply --config config.yml --target 1

# Validate schema integrity
baselinr migrate validate --config config.yml

๐Ÿณ Docker Development Environment

Baselinr includes a complete Docker environment for local development and testing.

Start the Environment

cd docker
docker-compose up -d

This will start:

Stop the Environment

cd docker
docker-compose down

๐Ÿ“Š Profiling Metrics

Baselinr computes the following metrics:

All Column Types

  • count: Total number of rows
  • null_count: Number of null values
  • null_ratio: Ratio of null values (0.0 to 1.0)
  • distinct_count: Number of distinct values
  • unique_ratio: Ratio of distinct values to total (0.0 to 1.0)
  • approx_distinct_count: Approximate distinct count (database-specific)
  • data_type_inferred: Inferred data type from values (email, url, date, etc.)
  • column_stability_score: Column presence stability (0.0 to 1.0)
  • column_age_days: Days since column first appeared
  • type_consistency_score: Type consistency across runs (0.0 to 1.0)

Numeric Columns

  • min: Minimum value
  • max: Maximum value
  • mean: Average value
  • stddev: Standard deviation
  • histogram: Distribution histogram (optional)

String Columns

  • min: Lexicographic minimum
  • max: Lexicographic maximum
  • min_length: Minimum string length
  • max_length: Maximum string length
  • avg_length: Average string length

Table-Level Metrics

  • row_count_change: Change in row count from previous run
  • row_count_change_percent: Percentage change in row count
  • row_count_stability_score: Row count stability (0.0 to 1.0)
  • row_count_trend: Trend direction (increasing/stable/decreasing)
  • schema_freshness: Timestamp of last schema modification
  • schema_version: Incrementing schema version number
  • column_count_change: Net change in column count

See docs/guides/PROFILING_ENRICHMENT.md for detailed documentation on enrichment features.

๐Ÿ”„ Dagster Integration

Baselinr can create Dagster assets dynamically from your configuration:

from baselinr.integrations.dagster import build_baselinr_definitions

defs = build_baselinr_definitions(
    config_path="config.yml",
    asset_prefix="baselinr",
    job_name="baselinr_profile_all",
    enable_sensor=True,  # optional
)

๐ŸŽฏ Use Cases

  • Data Quality Monitoring: Track data quality metrics over time
  • Schema Change Detection: Automatically detect schema changes
  • Statistical Drift Detection: Identify statistical anomalies in your data
  • Data Documentation: Generate up-to-date metadata about your datasets
  • CI/CD Integration: Fail builds when critical drift is detected

๐Ÿ“ Project Structure

baselinr/
โ”œโ”€โ”€ baselinr/           # Main package
โ”‚   โ”œโ”€โ”€ config/           # Configuration management
โ”‚   โ”œโ”€โ”€ connectors/       # Database connectors
โ”‚   โ”œโ”€โ”€ profiling/        # Profiling engine
โ”‚   โ”œโ”€โ”€ storage/          # Results storage
โ”‚   โ”œโ”€โ”€ drift/            # Drift detection
โ”‚   โ”œโ”€โ”€ integrations/
โ”‚   โ”‚   โ””โ”€โ”€ dagster/      # Dagster assets & sensors
โ”‚   โ””โ”€โ”€ cli.py            # CLI interface
โ”œโ”€โ”€ examples/             # Example configurations
โ”‚   โ”œโ”€โ”€ config.yml        # PostgreSQL example
โ”‚   โ”œโ”€โ”€ config_sqlite.yml # SQLite example
โ”‚   โ”œโ”€โ”€ config_mysql.yml  # MySQL example
โ”‚   โ”œโ”€โ”€ config_bigquery.yml # BigQuery example
โ”‚   โ”œโ”€โ”€ config_redshift.yml # Redshift example
โ”‚   โ”œโ”€โ”€ config_with_metrics.yml # Metrics example
โ”‚   โ”œโ”€โ”€ config_slack_alerts.yml # Slack alerts example
โ”‚   โ”œโ”€โ”€ dagster_repository.py
โ”‚   โ””โ”€โ”€ quickstart.py
โ”œโ”€โ”€ docker/               # Docker environment
โ”‚   โ”œโ”€โ”€ docker-compose.yml
โ”‚   โ”œโ”€โ”€ Dockerfile
โ”‚   โ”œโ”€โ”€ init_postgres.sql
โ”‚   โ”œโ”€โ”€ dagster.yaml
โ”‚   โ””โ”€โ”€ workspace.yaml
โ”œโ”€โ”€ setup.py
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

๐Ÿงช Running Examples

Quick Start Example

python examples/quickstart.py

CLI Examples

# View profiling plan (dry-run)
baselinr plan --config examples/config.yml

# View plan in JSON format
baselinr plan --config examples/config.yml --output json

# View plan with verbose details
baselinr plan --config examples/config.yml --verbose

# Profile all tables in config
baselinr profile --config examples/config.yml

# Profile with output to JSON
baselinr profile --config examples/config.yml --output results.json

# Dry run (don't write to storage)
baselinr profile --config examples/config.yml --dry-run

# Detect drift
baselinr drift --config examples/config.yml --dataset customers

# Detect drift with specific runs
baselinr drift --config examples/config.yml \
  --dataset customers \
  --baseline <run-id-1> \
  --current <run-id-2>

# Fail on critical drift (useful for CI/CD)
baselinr drift --config examples/config.yml \
  --dataset customers \
  --fail-on-drift

# Use statistical tests for advanced drift detection
# (configure in config.yml: strategy: statistical)

# Query profiling runs
baselinr query runs --config examples/config.yml --limit 10

# Query drift events for a table
baselinr query drift --config examples/config.yml \
  --table customers \
  --severity high \
  --days 7

# Get detailed run information
baselinr query run --config examples/config.yml \
  --run-id <run-id> \
  --format json

# View table profiling history
baselinr query table --config examples/config.yml \
  --table customers \
  --days 30 \
  --format csv \
  --output history.csv

# Check schema migration status
baselinr migrate status --config examples/config.yml

# Apply schema migrations
baselinr migrate apply --config examples/config.yml --target 1

# Validate schema integrity
baselinr migrate validate --config examples/config.yml

๐Ÿ” Drift Detection

Baselinr provides multiple drift detection strategies and intelligent baseline selection:

Available Strategies

  1. Absolute Threshold (default): Simple percentage-based thresholds

    • Low: 5% change
    • Medium: 15% change
    • High: 30% change
  2. Standard Deviation: Statistical significance based on standard deviations

  3. Statistical Tests (advanced): Multiple statistical tests for rigorous detection

    • Numeric columns: KS test, PSI, Z-score
    • Categorical columns: Chi-square, Entropy, Top-K stability
    • Automatically selects appropriate tests based on column type

Intelligent Baseline Selection

Baselinr automatically selects the optimal baseline for drift detection based on column characteristics:

  • Auto Selection: Automatically chooses the best baseline method per column
    • High variance columns โ†’ Moving average (smooths noise)
    • Seasonal columns โ†’ Prior period (accounts for weekly/monthly patterns)
    • Stable columns โ†’ Last run (simplest baseline)
  • Moving Average: Average of last N runs (configurable, default: 7)
  • Prior Period: Same period last week/month (handles seasonality)
  • Stable Window: Historical window with low drift (most reliable)
  • Last Run: Simple comparison to previous run (default)

Thresholds and baseline selection are fully configurable via the drift_detection configuration. See docs/guides/DRIFT_DETECTION.md for general drift detection and docs/guides/STATISTICAL_DRIFT_DETECTION.md for statistical tests.

๐Ÿ”” Event & Alert Hooks

Baselinr includes a pluggable event system that emits events for drift detection, schema changes, and profiling lifecycle events. You can register hooks to process these events for logging, persistence, or alerting.

Built-in Hooks

  • LoggingAlertHook: Log events to stdout
  • SQLEventHook: Persist events to any SQL database
  • SnowflakeEventHook: Persist events to Snowflake with VARIANT support

Example Configuration

hooks:
  enabled: true
  hooks:
    # Log all events
    - type: logging
      log_level: INFO
    
    # Persist to database
    - type: sql
      table_name: baselinr_events
      connection:
        type: postgres
        host: localhost
        database: monitoring
        username: user
        password: pass

Event Types

  • DataDriftDetected: Emitted when drift is detected
  • SchemaChangeDetected: Emitted when schema changes
  • ProfilingStarted: Emitted when profiling begins
  • ProfilingCompleted: Emitted when profiling completes
  • ProfilingFailed: Emitted when profiling fails

Custom Hooks

Create custom hooks by implementing the AlertHook protocol:

from baselinr.events import BaseEvent

class MyCustomHook:
    def handle_event(self, event: BaseEvent) -> None:
        # Process the event
        print(f"Event: {event.event_type}")

Configure custom hooks:

hooks:
  enabled: true
  hooks:
    - type: custom
      module: my_hooks
      class_name: MyCustomHook
      params:
        webhook_url: https://api.example.com/alerts

See docs/architecture/EVENTS_AND_HOOKS.md for comprehensive documentation and examples.

๐Ÿ”„ Schema Versioning & Migrations

Baselinr includes a built-in schema versioning system to manage database schema evolution safely.

Migration Commands

# Check current schema version status
baselinr migrate status --config config.yml

# Apply migrations to a specific version
baselinr migrate apply --config config.yml --target 1

# Preview migrations (dry run)
baselinr migrate apply --config config.yml --target 1 --dry-run

# Validate schema integrity
baselinr migrate validate --config config.yml

How It Works

  • Schema versions are tracked in the baselinr_schema_version table
  • Migrations are applied incrementally and can be rolled back
  • The system automatically detects when your database schema is out of date
  • Migrations are idempotent and safe to run multiple times

๐Ÿ” Metadata Querying

Baselinr provides powerful querying capabilities to explore your profiling history and drift events.

Query Commands

# Query profiling runs with filters
baselinr query runs --config config.yml \
  --table customers \
  --status completed \
  --days 30 \
  --limit 20 \
  --format table

# Query drift events
baselinr query drift --config config.yml \
  --table customers \
  --severity high \
  --days 7 \
  --format json

# Get detailed information about a specific run
baselinr query run --config config.yml \
  --run-id abc123-def456 \
  --format json

# View table profiling history over time
baselinr query table --config config.yml \
  --table customers \
  --schema public \
  --days 90 \
  --format csv \
  --output history.csv

Output Formats

All query commands support multiple output formats:

  • table: Human-readable table format (default)
  • json: JSON format for programmatic use
  • csv: CSV format for spreadsheet analysis

๐Ÿ› ๏ธ Configuration Options

Source Configuration

source:
  type: postgres | snowflake | sqlite | mysql | bigquery | redshift
  host: hostname
  port: 5432
  database: database_name
  username: user
  password: password
  schema: schema_name  # Optional
  
  # Snowflake-specific
  account: snowflake_account
  warehouse: warehouse_name
  role: role_name
  
  # SQLite-specific
  filepath: /path/to/database.db
  
  # BigQuery-specific (credentials via extra_params)
  extra_params:
    credentials_path: /path/to/service-account-key.json
    # Or use GOOGLE_APPLICATION_CREDENTIALS environment variable
  
  # MySQL-specific
  # Uses standard host/port/database/username/password
  
  # Redshift-specific
  # Uses standard host/port/database/username/password
  # Default port: 5439

Profiling Configuration

profiling:
  tables:
    - table: table_name
      schema: schema_name  # Optional
      sample_ratio: 1.0    # 0.0 to 1.0
  
  default_sample_ratio: 1.0
  max_distinct_values: 1000
  compute_histograms: true  # Enable for statistical tests
  histogram_bins: 10
  
  metrics:
    - count
    - null_count
    - null_ratio
    - distinct_count
    - unique_ratio
    - approx_distinct_count
    - min
    - max
    - mean
    - stddev
    - histogram
    - data_type_inferred

Drift Detection Configuration

drift_detection:
  # Strategy: absolute_threshold | standard_deviation | statistical
  strategy: absolute_threshold
  
  # Absolute threshold (default)
  absolute_threshold:
    low_threshold: 5.0
    medium_threshold: 15.0
    high_threshold: 30.0
  
  # Baseline auto-selection configuration
  baselines:
    strategy: auto  # auto | last_run | moving_average | prior_period | stable_window
    windows:
      moving_average: 7    # Number of runs for moving average
      prior_period: 7      # Days for prior period (1=day, 7=week, 30=month)
      min_runs: 3          # Minimum runs required for auto-selection
  
  # Statistical tests (advanced)
  # statistical:
  #   tests:
  #     - ks_test
  #     - psi
  #     - z_score
  #     - chi_square
  #     - entropy
  #     - top_k
  #   sensitivity: medium
  #   test_params:
  #     ks_test:
  #       alpha: 0.05
  #     psi:
  #       buckets: 10
  #       threshold: 0.2

๐Ÿ” Environment Variables

Baselinr supports environment variable overrides:

# Override source connection
export BASELINR_SOURCE__HOST=prod-db.example.com
export BASELINR_SOURCE__PASSWORD=secret

# Override environment
export BASELINR_ENVIRONMENT=production

# Run profiling
baselinr profile --config config.yml

๐Ÿงช Development

Run Tests

pytest

Code Formatting

black baselinr/
isort baselinr/

Type Checking

mypy baselinr/

๐Ÿ“ License

Business Source License 1.1 - see LICENSE file for details.

This software is available under the Business Source License (BSL) 1.1, which allows free use for non-commercial purposes. Commercial use requires a license. The license will convert to Apache License 2.0 on January 1, 2028. For commercial licensing inquiries, please contact the project maintainers.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“ง Contact

For questions and support, please open an issue on GitHub.


Baselinr - Modern data profiling made simple ๐Ÿงฉ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

baselinr-0.1.0.tar.gz (286.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

baselinr-0.1.0-py3-none-any.whl (140.1 kB view details)

Uploaded Python 3

File details

Details for the file baselinr-0.1.0.tar.gz.

File metadata

  • Download URL: baselinr-0.1.0.tar.gz
  • Upload date:
  • Size: 286.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for baselinr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6e6555b9c5b92c5245f1edd990e4c9bf443622e770b89a067cefb874c5faa045
MD5 4714ba967d7ddeeccd6af6375bd9749e
BLAKE2b-256 1a6c994ba8c71cab91b90fc4561a49be5eb0517e40f651a5d7ffc2b8d915feb0

See more details on using hashes here.

File details

Details for the file baselinr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: baselinr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 140.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for baselinr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a639b16b7754123d88fa4bdc071c8aced46bb5a7e09ec465d191ed30e362b06a
MD5 1f143bf7d257965f2107a79205cf4662
BLAKE2b-256 7174857093d7d13ff3bf654730239793e6528fb7561a2e8af8bfd2580e59f9b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page