Skip to main content

Data profiling and validation engine for modern data warehouses

Project description

Sparvi Core

PyPI version License

Like a hawk keeping watch over your data, Sparvi monitors data pipelines, detects anomalies, tracks schema changes, and ensures data integrity with sharp precision.

Sparvi Core is a Python library for data profiling and validation in modern data warehouses. It helps data engineers and analysts maintain high-quality data by monitoring schema changes, detecting anomalies, and validating data against custom rules.

Features

Data Profiling

  • Automated Metrics: Compute essential quality metrics (null rates, duplicates, outliers) to understand your data's health at a glance
  • Schema Analysis: Detect column types, relationships, and constraints
  • Distribution Analysis: Understand the distribution of values in your data
  • Historical Comparisons: Compare current profiles with previous runs to detect changes
  • Anomaly Detection: Automatically detect anomalies in your data

Data Validation

  • Custom Validation Rules: Define and run your own validation rules
  • SQL-Based Rules: Use SQL to define validation queries
  • Default Rules Generator: Automatically generate sensible validation rules based on your data
  • Detailed Results: Get comprehensive information about validation failures

Installation

# Basic installation
pip install sparvi-core

# With support for Snowflake
pip install sparvi-core[snowflake]

# With support for PostgreSQL
pip install sparvi-core[postgres]

# With all extras
pip install sparvi-core[snowflake,postgres]

Quick Start

Command Line Interface

Profile a table:

# Basic profiling
sparvi profile "duckdb:///path/to/database.duckdb" employees

# Save the profile to a file
sparvi profile "postgresql://user:pass@localhost/mydatabase" customers --output profile.json

# Compare with a previous profile
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" orders --compare previous_profile.json

Validate a table:

# Generate and run default validations
sparvi validate "duckdb:///path/to/database.duckdb" employees --generate-defaults

# Save the default rules to a YAML file
sparvi validate "duckdb:///path/to/database.duckdb" employees --generate-defaults --save-defaults rules.yaml

# Run validations from a file
sparvi validate "postgresql://user:pass@localhost/mydatabase" customers --rules rules.yaml

# Save validation results to a file
sparvi validate "snowflake://user:pass@account/database/schema?warehouse=wh" orders --rules rules.yaml --output results.json

Python API

Profile a table:

from sparvi.profiler.profile_engine import profile_table

# Run a profile
profile = profile_table("duckdb:///path/to/database.duckdb", "employees")

# Check completeness
for column, stats in profile["completeness"].items():
    print(f"{column}: {stats['null_percentage']}% null, {stats['distinct_percentage']}% distinct")

# Check for anomalies
for anomaly in profile.get("anomalies", []):
    print(f"Anomaly: {anomaly['description']}")

# Check for schema shifts
for shift in profile.get("schema_shifts", []):
    print(f"Schema shift: {shift['description']}")

Validate a table:

from sparvi.validations.validator import run_validations, load_rules_from_file
from sparvi.validations.default_validations import get_default_validations

# Generate default validation rules
rules = get_default_validations("duckdb:///path/to/database.duckdb", "employees")

# Run the validations
results = run_validations("duckdb:///path/to/database.duckdb", rules)

# Check results
for result in results:
    status = "PASS" if result["is_valid"] else "FAIL"
    print(f"{result['rule_name']}: {status}")
    if not result["is_valid"]:
        print(f"  Expected: {result['expected_value']}, Actual: {result['actual_value']}")

Multi-Database Support

Sparvi Core now has enhanced support for multiple database engines:

  • DuckDB: Included by default, ideal for local analysis
  • PostgreSQL: Install with pip install sparvi-core[postgres]
  • Snowflake: Install with pip install sparvi-core[snowflake]

The library uses database-specific adapters to ensure that SQL queries are optimized for each database engine. This provides consistent results while taking advantage of each database's specific features.

For example, Sparvi automatically adapts:

  • Regular expression syntax
  • Date/time functions
  • Percentile calculations
  • String operations

This means you can profile and validate your data using the same API regardless of the underlying database.

Database Compatibility

PostgreSQL Considerations

When working with PostgreSQL, keep in mind:

  • For date difference functions, we use PostgreSQL's DATE_PART function
  • Regex pattern matching uses PostgreSQL's ~ operator
  • When using the FILTER clause, ensure you have PostgreSQL 9.4 or higher

Snowflake Considerations

When working with Snowflake, keep in mind:

  • Regex pattern matching uses Snowflake's REGEXP_LIKE function
  • String functions may behave slightly differently than in PostgreSQL or DuckDB
  • To optimize performance with large Snowflake tables, consider using warehouse sizing options

Testing Your Setup

To verify your database connection and functionality, you can use:

from sparvi.db.adapters import get_adapter_for_connection

# Test connection with a simple query
engine = create_engine("your_connection_string")
adapter = get_adapter_for_connection(engine)
print(f"Connected to: {adapter.__class__.__name__}")

Contributing

  • Contributions are welcome! Please feel free to submit a Pull Request

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparvi_core-0.4.1.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparvi_core-0.4.1-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file sparvi_core-0.4.1.tar.gz.

File metadata

  • Download URL: sparvi_core-0.4.1.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for sparvi_core-0.4.1.tar.gz
Algorithm Hash digest
SHA256 34e03d913400a49abcc16c4a37eb5bad3a31c76b4ac14aa0626225dd777ca85a
MD5 a1a14af60a581729baa3f3847a424756
BLAKE2b-256 f7020fa8ca17b19ee8bbcba9694ca621fe73913737b5129b77098d3386dee723

See more details on using hashes here.

File details

Details for the file sparvi_core-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: sparvi_core-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for sparvi_core-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 909938b199c8b9b3762640b7f5e4499999027c054bf87ee630310c77d365ecc5
MD5 4e8941e2ec23012bee1de116e2710f58
BLAKE2b-256 9a4e3dc9f421778df2f2c81b299f643fde80f685730d051db37f8ee5d73c269d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page