Skip to main content

Data profiling and validation engine for modern data warehouses

Project description

Sparvi Core

PyPI version License

Like a hawk keeping watch over your data, Sparvi monitors data pipelines, detects anomalies, tracks schema changes, and ensures data integrity with sharp precision.

Sparvi Core is a Python library for data profiling and validation in modern data warehouses. It helps data engineers and analysts maintain high-quality data by monitoring schema changes, detecting anomalies, and validating data against custom rules.

Supported Data Warehouses: Snowflake, BigQuery, Redshift, PostgreSQL, DuckDB

Features

Data Profiling

  • Automated Metrics: Compute essential quality metrics (null rates, duplicates, outliers) to understand your data's health at a glance
  • Schema Analysis: Detect column types, relationships, and constraints
  • Distribution Analysis: Understand the distribution of values in your data
  • Historical Comparisons: Compare current profiles with previous runs to detect changes
  • Anomaly Detection: Automatically detect anomalies in your data

Data Validation

  • Custom Validation Rules: Define and run your own validation rules
  • SQL-Based Rules: Use SQL to define validation queries
  • Default Rules Generator: Automatically generate sensible validation rules based on your data
  • Detailed Results: Get comprehensive information about validation failures

dbt Integration

  • Upload dbt Artifacts: Send your dbt manifest and run results to Sparvi platform
  • Unified Monitoring: See dbt tests alongside other data quality metrics
  • CI/CD Ready: Integrate into GitHub Actions, GitLab CI, Azure DevOps, etc.
  • Automatic Discovery: Finds manifest.json in ./target/ automatically

Installation

# Basic installation (includes Snowflake, BigQuery, Redshift, DuckDB support)
pip install sparvi-core

# With additional PostgreSQL support
pip install sparvi-core[postgres]

# With development tools
pip install sparvi-core[dev]

Quick Start

dbt Integration

Upload your dbt artifacts to Sparvi:

# Run dbt
dbt run && dbt test

# Upload to Sparvi (requires SPARVI_API_KEY env var)
sparvi dbt upload

# Or specify API key explicitly
sparvi dbt upload --api-key=sk_live_...

Command Line Interface

Profile a table:

# Basic profiling with Snowflake
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" orders

# Basic profiling with BigQuery  
sparvi profile "bigquery://project-id/dataset-name" customers

# Basic profiling with Redshift
sparvi profile "redshift://user:pass@host:5439/database" employees

# Basic profiling with DuckDB
sparvi profile "duckdb:///path/to/database.duckdb" products

# Save the profile to a file
sparvi profile "postgresql://user:pass@localhost/mydatabase" customers --output profile.json

# Compare with a previous profile
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" orders --compare previous_profile.json

Validate a table:

# Generate and run default validations
sparvi validate "snowflake://user:pass@account/database/schema?warehouse=wh" orders --generate-defaults

# BigQuery validation with custom rules
sparvi validate "bigquery://project-id/dataset-name" customers --rules rules.yaml

# Redshift validation with output
sparvi validate "redshift://user:pass@host:5439/database" employees --rules rules.yaml --output results.json

# Save the default rules to a YAML file
sparvi validate "duckdb:///path/to/database.duckdb" products --generate-defaults --save-defaults rules.yaml

# Run validations from a file
sparvi validate "postgresql://user:pass@localhost/mydatabase" customers --rules rules.yaml

Creating Custom Validations

You can create custom validation rules in YAML format:

rules:
- name: check_orders_not_empty
  description: Ensure orders table has at least one row
  query: SELECT COUNT(*) FROM orders
  operator: greater_than
  expected_value: 0

- name: check_revenue_positive
  description: Ensure all order revenue is positive
  query: SELECT COUNT(*) FROM orders WHERE revenue <= 0
  operator: equals
  expected_value: 0

- name: check_shipping_addresses_valid
  description: Ensure shipping addresses contain valid format
  query: >
    SELECT COUNT(*) FROM orders 
    WHERE shipping_address IS NOT NULL 
    AND shipping_address NOT LIKE '%_%,%_%'
  operator: equals
  expected_value: 0

Save these rules to a file (e.g., my_rules.yaml) and run them with:

sparvi validate "your_connection_string" orders --rules my_rules.yaml

Python API

Profile a table:

from sparvi.profiler.profile_engine import profile_table

# Run a profile on a Snowflake table
profile = profile_table("snowflake://user:pass@account/database/schema?warehouse=wh", "orders")

# Or use environment variables (recommended for production)
import os
os.environ["SNOWFLAKE_USER"] = "your_user"
os.environ["SNOWFLAKE_PASSWORD"] = "your_password"
os.environ["SNOWFLAKE_ACCOUNT"] = "your_account"
os.environ["SNOWFLAKE_DATABASE"] = "your_database"
os.environ["SNOWFLAKE_SCHEMA"] = "your_schema"
os.environ["SNOWFLAKE_WAREHOUSE"] = "your_warehouse"

# Then profile using environment variables
profile = profile_table(table="orders")  # Connection string is optional if env vars are set

# Check completeness
for column, stats in profile["completeness"].items():
    print(f"{column}: {stats['null_percentage']}% null, {stats['distinct_percentage']}% distinct")

# Check for anomalies
for anomaly in profile.get("anomalies", []):
    print(f"Anomaly: {anomaly['description']}")

# Check for schema shifts
for shift in profile.get("schema_shifts", []):
    print(f"Schema shift: {shift['description']}")

Validate a table:

from sparvi.validations.validator import run_validations, load_rules_from_file
from sparvi.validations.default_validations import get_default_validations

# Generate default validation rules for a Snowflake table
rules = get_default_validations("snowflake://user:pass@account/database/schema?warehouse=wh", "orders")

# Run the validations
results = run_validations("snowflake://user:pass@account/database/schema?warehouse=wh", rules)

# Check results
for result in results:
    status = "PASS" if result["is_valid"] else "FAIL"
    print(f"{result['rule_name']}: {status}")
    if not result["is_valid"]:
        print(f"  Expected: {result['expected_value']}, Actual: {result['actual_value']}")

Validation Framework

Sparvi Core's validation framework allows you to define, run, and manage data quality rules across your data warehouse.

Using Default Validations

The easiest way to get started with validations is to use the default rules generator:

from sparvi.validations.default_validations import get_default_validations
from sparvi.validations.validator import run_validations

# Generate default validation rules based on table structure
rules = get_default_validations("snowflake://user:pass@account/database/schema?warehouse=wh", "orders")

# Run the validations
results = run_validations("snowflake://user:pass@account/database/schema?warehouse=wh", rules)

# Export the default rules for later use
from sparvi.validations.validator import export_rules
export_rules(rules, "order_rules.yaml", format="yaml")

Default validations include checks for:

  • Empty tables
  • Primary key uniqueness
  • Missing required fields
  • Negative values in numeric columns
  • Invalid date ranges
  • Format validation for emails, phone numbers, etc.
  • Statistical outliers

Creating Custom Validations in YAML

YAML files provide a clean, readable way to define validation rules. Create a file like my_rules.yaml:

rules:
- name: check_orders_not_empty
  description: Ensure orders table has at least one row
  query: SELECT COUNT(*) FROM orders
  operator: greater_than
  expected_value: 0

- name: check_revenue_positive
  description: Ensure all order revenue is positive
  query: SELECT COUNT(*) FROM orders WHERE revenue <= 0
  operator: equals
  expected_value: 0

- name: check_high_value_orders_reviewed
  description: Ensure all high-value orders are reviewed
  query: >
    SELECT COUNT(*) FROM orders 
    WHERE total_amount > 10000 
    AND review_status IS NULL
  operator: equals
  expected_value: 0
  
- name: check_shipping_timeframe
  description: Check if shipping dates make sense
  query: >
    SELECT COUNT(*) FROM orders
    WHERE ship_date < order_date
    OR ship_date > DATEADD(day, 30, order_date)
  operator: equals
  expected_value: 0
  
- name: check_customer_distribution
  description: Make sure no single customer represents >50% of orders
  query: >
    WITH customer_counts AS (
      SELECT customer_id, COUNT(*) AS order_count,
      COUNT(*) * 100.0 / (SELECT COUNT(*) FROM orders) AS percentage
      FROM orders
      GROUP BY customer_id
    )
    SELECT COUNT(*) FROM customer_counts WHERE percentage > 50
  operator: equals
  expected_value: 0

Run these validations using the CLI:

sparvi validate "snowflake://user:pass@account/database/schema?warehouse=wh" orders --rules my_rules.yaml --output results.json

Or from Python:

from sparvi.validations.validator import load_rules_from_file, run_validations

# Load rules from YAML file
rules = load_rules_from_file("my_rules.yaml")

# Run validations
results = run_validations("your_connection_string", rules)

Creating Custom Validations in Python

You can programmatically create and run custom validation rules:

from sparvi.validations.validator import run_validations, export_rules

# Define custom validation rules
custom_rules = [
    {
        "name": "check_orders_recent_data",
        "description": "Ensure orders table has data for current month",
        "query": "SELECT COUNT(*) FROM orders WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE())",
        "operator": "greater_than",
        "expected_value": 0
    },
    {
        "name": "check_high_value_orders_have_approval",
        "description": "Ensure all orders >$10,000 have manager approval",
        "query": """
            SELECT COUNT(*) FROM orders 
            WHERE total_amount > 10000 
            AND manager_approval_id IS NULL
        """,
        "operator": "equals",
        "expected_value": 0
    }
]

# Run the custom validations
results = run_validations("snowflake://user:pass@account/database/schema?warehouse=wh", custom_rules)

# Export the rules for future use
export_rules(custom_rules, "high_value_order_rules.yaml", format="yaml")

# Later, load and run the rules
from sparvi.validations.validator import load_rules_from_file
saved_rules = load_rules_from_file("high_value_order_rules.yaml")
results = run_validations("your_connection_string", saved_rules)

Available Validation Operators

Sparvi supports several comparison operators for validation rules (both verbose and symbolic formats):

  • equals or ==: Actual value must exactly match expected value
  • greater_than or >: Actual value must be greater than expected value
  • less_than or <: Actual value must be less than expected value
  • greater_than_or_equal or >=: Actual value must be greater than or equal to expected value
  • less_than_or_equal or <=: Actual value must be less than or equal to expected value
  • not_equals or !=: Actual value must not match expected value
  • between: Actual value must be between a range (provide expected_value as [min, max])

Working with Validation Results

Validation results contain detailed information about each rule:

# Analyze validation results
for result in results:
    status = "PASS" if result["is_valid"] else "FAIL"
    print(f"{result['rule_name']}: {status}")
    
    if not result["is_valid"]:
        print(f"  Description: {result['description']}")
        print(f"  Expected: {result['expected_value']}, Actual: {result['actual_value']}")

# Save results to JSON for reporting
import json
with open("validation_results.json", "w") as f:
    json.dump({"results": results}, f, indent=2)

Example results JSON:

{
  "results": [
    {
      "rule_name": "check_orders_not_empty",
      "is_valid": true,
      "actual_value": 1245,
      "expected_value": 0,
      "description": "Ensure orders table has at least one row"
    },
    {
      "rule_name": "check_revenue_positive",
      "is_valid": false,
      "actual_value": 3,
      "expected_value": 0,
      "description": "Ensure all order revenue is positive"
    }
  ]
}

Data Profiling

Sparvi Core provides comprehensive data profiling capabilities to help you understand your data's quality, structure, and characteristics.

Basic Profiling

from sparvi.profiler.profile_engine import profile_table

# Profile a table
profile = profile_table("snowflake://user:pass@account/database/schema?warehouse=wh", "customers")

# Access profile metrics
print(f"Row count: {profile['row_count']}")
print(f"Duplicate count: {profile['duplicate_count']}")

# Column completeness
for column, stats in profile["completeness"].items():
    print(f"{column}: {stats['null_percentage']}% null, {stats['distinct_percentage']}% distinct")

Profiling Features

Sparvi's profiler collects a wide range of metrics:

  1. Basic Statistics

    • Row count
    • Duplicate row count
    • Column counts
  2. Completeness Analysis

    • NULL value percentages
    • Distinct value percentages
    • Most frequent values
  3. Numeric Column Analysis

    • Min/max values
    • Mean, median
    • Standard deviation
    • Quartiles (Q1, Q3)
    • Outlier detection
  4. Text Column Analysis

    • Min/max/average lengths
    • Pattern recognition (emails, phone numbers, etc.)
    • Common prefixes/suffixes
  5. Date Column Analysis

    • Date ranges
    • Distribution over time
    • Seasonality detection
  6. Sample Data Collection (optional)

    • Representative samples for visual inspection

Using Profile Output

Profile results are returned as a nested dictionary that you can explore, save, or analyze:

# Save profile results to JSON
import json
with open("customer_profile.json", "w") as f:
    json.dump(profile, f, indent=2)

# Load saved profile for comparison
with open("customer_profile.json", "r") as f:
    historical_profile = json.load(f)

# Run profile with historical comparison
new_profile = profile_table(
    "snowflake://user:pass@account/database/schema?warehouse=wh", 
    "customers",
    historical_data=historical_profile
)

# Check for anomalies detected against historical data
if new_profile.get("anomalies"):
    print(f"Found {len(new_profile['anomalies'])} anomalies vs. historical data:")
    for anomaly in new_profile["anomalies"]:
        print(f"- {anomaly['description']} (severity: {anomaly['severity']})")

Command Line Profiling

Profile tables directly from the command line:

# Basic profiling
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers

# Save profile output
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers --output customer_profile.json

# Compare with historical profile
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers --compare previous_profile.json

# Format options (detailed, default, minimal)
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers --format detailed

The CLI renders nicely formatted tables in your terminal with color-coding for issues, making it easy to spot problems quickly.

Database Compatibility

Snowflake Considerations (Recommended Database)

When working with Snowflake, Sparvi automatically:

  • Uses Snowflake's efficient SAMPLE (n ROWS) syntax for better performance
  • Optimizes session parameters with USE_CACHED_RESULT = TRUE
  • Employs Snowflake's REGEXP_LIKE function for pattern matching
  • Sets appropriate query tags for monitoring
  • Leverages Snowflake's native date functions

To connect to Snowflake, you can:

  • Provide a connection string: snowflake://user:pass@account/database/schema?warehouse=wh
  • Or use environment variables:
    SNOWFLAKE_USER=your_user
    SNOWFLAKE_PASSWORD=your_password
    SNOWFLAKE_ACCOUNT=your_account
    SNOWFLAKE_DATABASE=your_database
    SNOWFLAKE_SCHEMA=your_schema
    SNOWFLAKE_WAREHOUSE=your_warehouse
    

PostgreSQL Considerations

When working with PostgreSQL, keep in mind:

  • For date difference functions, we use PostgreSQL's DATE_PART function
  • Regex pattern matching uses PostgreSQL's ~ operator
  • When using the FILTER clause, ensure you have PostgreSQL 9.4 or higher

BigQuery Considerations

When working with BigQuery, Sparvi automatically:

  • Uses BigQuery's PERCENTILE_CONT for accurate percentile calculations
  • Employs REGEXP_CONTAINS for pattern matching with raw strings
  • Optimizes queries with TABLESAMPLE SYSTEM for large table sampling
  • Sets billing limits (1GB default) and location preferences (US default)
  • Uses DATE_DIFF function for date calculations

Connection string format: bigquery://project-id/dataset-name

Redshift Considerations

When working with Redshift, Sparvi automatically:

  • Uses APPROXIMATE PERCENTILE_DISC for percentile calculations
  • Employs REGEXP operator for pattern matching
  • Optimizes sampling with ORDER BY RANDOM()
  • Sets SSL connections by default for security
  • Uses DATEDIFF function for date calculations

Connection string format: redshift://user:password@host:port/database

DuckDB Considerations

DuckDB is ideal for local analysis and development:

  • No server setup required
  • Fast local processing
  • Compatible with most SQL features
  • Perfect for testing validation rules before running on production data

Testing Your Setup

To verify your database connection and functionality, you can use:

# Test connection using CLI
sparvi test-connection --connection "your_connection_string"

# Or use environment variables
export SNOWFLAKE_USER=your_user
export SNOWFLAKE_PASSWORD=your_password
export SNOWFLAKE_ACCOUNT=your_account
export SNOWFLAKE_DATABASE=your_database
sparvi test-connection

Environment Variable Support

Sparvi Core supports using environment variables for sensitive connection information:

# For Snowflake (recommended)
export SNOWFLAKE_USER=your_user
export SNOWFLAKE_PASSWORD=your_password
export SNOWFLAKE_ACCOUNT=your_account
export SNOWFLAKE_DATABASE=your_database
export SNOWFLAKE_SCHEMA=public
export SNOWFLAKE_WAREHOUSE=compute_wh

# Generic connection string
export DATABASE_URL="your_connection_string"

When using the CLI or Python API without specifying a connection string, Sparvi will automatically use these environment variables.

Contributing

  • Contributions are welcome! Please feel free to submit a Pull Request
  • For development setup, install with development dependencies: pip install -e ".[dev]"
  • Run tests with: pytest

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparvi_core-0.7.2.tar.gz (47.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparvi_core-0.7.2-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file sparvi_core-0.7.2.tar.gz.

File metadata

  • Download URL: sparvi_core-0.7.2.tar.gz
  • Upload date:
  • Size: 47.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sparvi_core-0.7.2.tar.gz
Algorithm Hash digest
SHA256 815e87eee6cb2b656d1ebf5154ad894d27cab523d4d90d871272d892c7ef5a4a
MD5 10785a41dd1332199d3510e6929e6235
BLAKE2b-256 9f4649ec897f579ab58d58478d0125663e708e257a294536da209ab667cd6d53

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparvi_core-0.7.2.tar.gz:

Publisher: publish.yml on sparvidata/sparvi-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sparvi_core-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: sparvi_core-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sparvi_core-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d75bec1397fa272326ae688695912b2449d69b34b74854be8879340932441a0a
MD5 e6d1d8a10176df89047213884ae5a932
BLAKE2b-256 d99754d559d88f72313d307b2a2f76456d244906f869443ed5cb4186376fdd78

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparvi_core-0.7.2-py3-none-any.whl:

Publisher: publish.yml on sparvidata/sparvi-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page