Data profiling and validation engine for modern data warehouses

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

mhardman100

These details have not been verified by PyPI

Project description

Sparvi Core

Like a hawk keeping watch over your data, Sparvi monitors data pipelines, detects anomalies, tracks schema changes, and ensures data integrity with sharp precision.

Sparvi Core is a Python library for data profiling and validation in modern data warehouses. It helps data engineers and analysts maintain high-quality data by monitoring schema changes, detecting anomalies, and validating data against custom rules.

Features

Data Profiling

Automated Metrics: Compute essential quality metrics (null rates, duplicates, outliers) to understand your data's health at a glance
Schema Analysis: Detect column types, relationships, and constraints
Distribution Analysis: Understand the distribution of values in your data
Historical Comparisons: Compare current profiles with previous runs to detect changes
Anomaly Detection: Automatically detect anomalies in your data

Data Validation

Custom Validation Rules: Define and run your own validation rules
SQL-Based Rules: Use SQL to define validation queries
Default Rules Generator: Automatically generate sensible validation rules based on your data
Detailed Results: Get comprehensive information about validation failures

Installation

# Basic installation
pip install sparvi-core

# With optimized support for Snowflake (recommended for Snowflake users)
pip install sparvi-core[snowflake]

# With support for PostgreSQL
pip install sparvi-core[postgres]

# With all extras
pip install sparvi-core[snowflake,postgres]

Quick Start

Command Line Interface

Profile a table:

# Basic profiling with Snowflake
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" orders

# Basic profiling with DuckDB
sparvi profile "duckdb:///path/to/database.duckdb" employees

# Save the profile to a file
sparvi profile "postgresql://user:pass@localhost/mydatabase" customers --output profile.json

# Compare with a previous profile
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" orders --compare previous_profile.json

Validate a table:

# Generate and run default validations
sparvi validate "snowflake://user:pass@account/database/schema?warehouse=wh" orders --generate-defaults

# Save the default rules to a YAML file
sparvi validate "duckdb:///path/to/database.duckdb" employees --generate-defaults --save-defaults rules.yaml

# Run validations from a file
sparvi validate "postgresql://user:pass@localhost/mydatabase" customers --rules rules.yaml

# Save validation results to a file
sparvi validate "snowflake://user:pass@account/database/schema?warehouse=wh" orders --rules rules.yaml --output results.json

Creating Custom Validations

You can create custom validation rules in YAML format:

rules:
- name: check_orders_not_empty
  description: Ensure orders table has at least one row
  query: SELECT COUNT(*) FROM orders
  operator: greater_than
  expected_value: 0

- name: check_revenue_positive
  description: Ensure all order revenue is positive
  query: SELECT COUNT(*) FROM orders WHERE revenue <= 0
  operator: equals
  expected_value: 0

- name: check_shipping_addresses_valid
  description: Ensure shipping addresses contain valid format
  query: >
    SELECT COUNT(*) FROM orders 
    WHERE shipping_address IS NOT NULL 
    AND shipping_address NOT LIKE '%_%,%_%'
  operator: equals
  expected_value: 0

Save these rules to a file (e.g., my_rules.yaml) and run them with:

sparvi validate "your_connection_string" orders --rules my_rules.yaml

Python API

Profile a table:

from sparvi.profiler.profile_engine import profile_table

# Run a profile on a Snowflake table
profile = profile_table("snowflake://user:pass@account/database/schema?warehouse=wh", "orders")

# Or use environment variables (recommended for production)
import os
os.environ["SNOWFLAKE_USER"] = "your_user"
os.environ["SNOWFLAKE_PASSWORD"] = "your_password"
os.environ["SNOWFLAKE_ACCOUNT"] = "your_account"
os.environ["SNOWFLAKE_DATABASE"] = "your_database"
os.environ["SNOWFLAKE_SCHEMA"] = "your_schema"
os.environ["SNOWFLAKE_WAREHOUSE"] = "your_warehouse"

# Then profile using environment variables
profile = profile_table(table="orders")  # Connection string is optional if env vars are set

# Check completeness
for column, stats in profile["completeness"].items():
    print(f"{column}: {stats['null_percentage']}% null, {stats['distinct_percentage']}% distinct")

# Check for anomalies
for anomaly in profile.get("anomalies", []):
    print(f"Anomaly: {anomaly['description']}")

# Check for schema shifts
for shift in profile.get("schema_shifts", []):
    print(f"Schema shift: {shift['description']}")

Validate a table:

from sparvi.validations.validator import run_validations, load_rules_from_file
from sparvi.validations.default_validations import get_default_validations

# Generate default validation rules for a Snowflake table
rules = get_default_validations("snowflake://user:pass@account/database/schema?warehouse=wh", "orders")

# Run the validations
results = run_validations("snowflake://user:pass@account/database/schema?warehouse=wh", rules)

# Check results
for result in results:
    status = "PASS" if result["is_valid"] else "FAIL"
    print(f"{result['rule_name']}: {status}")
    if not result["is_valid"]:
        print(f"  Expected: {result['expected_value']}, Actual: {result['actual_value']}")

Validation Framework

Sparvi Core's validation framework allows you to define, run, and manage data quality rules across your data warehouse.

Using Default Validations

The easiest way to get started with validations is to use the default rules generator:

from sparvi.validations.default_validations import get_default_validations
from sparvi.validations.validator import run_validations

# Generate default validation rules based on table structure
rules = get_default_validations("snowflake://user:pass@account/database/schema?warehouse=wh", "orders")

# Run the validations
results = run_validations("snowflake://user:pass@account/database/schema?warehouse=wh", rules)

# Export the default rules for later use
from sparvi.validations.validator import export_rules
export_rules(rules, "order_rules.yaml", format="yaml")

Default validations include checks for:

Empty tables
Primary key uniqueness
Missing required fields
Negative values in numeric columns
Invalid date ranges
Format validation for emails, phone numbers, etc.
Statistical outliers

Creating Custom Validations in YAML

YAML files provide a clean, readable way to define validation rules. Create a file like my_rules.yaml:

rules:
- name: check_orders_not_empty
  description: Ensure orders table has at least one row
  query: SELECT COUNT(*) FROM orders
  operator: greater_than
  expected_value: 0

- name: check_revenue_positive
  description: Ensure all order revenue is positive
  query: SELECT COUNT(*) FROM orders WHERE revenue <= 0
  operator: equals
  expected_value: 0

- name: check_high_value_orders_reviewed
  description: Ensure all high-value orders are reviewed
  query: >
    SELECT COUNT(*) FROM orders 
    WHERE total_amount > 10000 
    AND review_status IS NULL
  operator: equals
  expected_value: 0
  
- name: check_shipping_timeframe
  description: Check if shipping dates make sense
  query: >
    SELECT COUNT(*) FROM orders
    WHERE ship_date < order_date
    OR ship_date > DATEADD(day, 30, order_date)
  operator: equals
  expected_value: 0
  
- name: check_customer_distribution
  description: Make sure no single customer represents >50% of orders
  query: >
    WITH customer_counts AS (
      SELECT customer_id, COUNT(*) AS order_count,
      COUNT(*) * 100.0 / (SELECT COUNT(*) FROM orders) AS percentage
      FROM orders
      GROUP BY customer_id
    )
    SELECT COUNT(*) FROM customer_counts WHERE percentage > 50
  operator: equals
  expected_value: 0

Run these validations using the CLI:

sparvi validate "snowflake://user:pass@account/database/schema?warehouse=wh" orders --rules my_rules.yaml --output results.json

Or from Python:

from sparvi.validations.validator import load_rules_from_file, run_validations

# Load rules from YAML file
rules = load_rules_from_file("my_rules.yaml")

# Run validations
results = run_validations("your_connection_string", rules)

Creating Custom Validations in Python

You can programmatically create and run custom validation rules:

from sparvi.validations.validator import run_validations, export_rules

# Define custom validation rules
custom_rules = [
    {
        "name": "check_orders_recent_data",
        "description": "Ensure orders table has data for current month",
        "query": "SELECT COUNT(*) FROM orders WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE())",
        "operator": "greater_than",
        "expected_value": 0
    },
    {
        "name": "check_high_value_orders_have_approval",
        "description": "Ensure all orders >$10,000 have manager approval",
        "query": """
            SELECT COUNT(*) FROM orders 
            WHERE total_amount > 10000 
            AND manager_approval_id IS NULL
        """,
        "operator": "equals",
        "expected_value": 0
    }
]

# Run the custom validations
results = run_validations("snowflake://user:pass@account/database/schema?warehouse=wh", custom_rules)

# Export the rules for future use
export_rules(custom_rules, "high_value_order_rules.yaml", format="yaml")

# Later, load and run the rules
from sparvi.validations.validator import load_rules_from_file
saved_rules = load_rules_from_file("high_value_order_rules.yaml")
results = run_validations("your_connection_string", saved_rules)

Available Validation Operators

Sparvi supports several comparison operators for validation rules:

equals: Actual value must exactly match expected value
greater_than: Actual value must be greater than expected value
less_than: Actual value must be less than expected value
between: Actual value must be between a range (provide expected_value as [min, max])

Working with Validation Results

Validation results contain detailed information about each rule:

# Analyze validation results
for result in results:
    status = "PASS" if result["is_valid"] else "FAIL"
    print(f"{result['rule_name']}: {status}")
    
    if not result["is_valid"]:
        print(f"  Description: {result['description']}")
        print(f"  Expected: {result['expected_value']}, Actual: {result['actual_value']}")

# Save results to JSON for reporting
import json
with open("validation_results.json", "w") as f:
    json.dump({"results": results}, f, indent=2)

Example results JSON:

{
  "results": [
    {
      "rule_name": "check_orders_not_empty",
      "is_valid": true,
      "actual_value": 1245,
      "expected_value": 0,
      "description": "Ensure orders table has at least one row"
    },
    {
      "rule_name": "check_revenue_positive",
      "is_valid": false,
      "actual_value": 3,
      "expected_value": 0,
      "description": "Ensure all order revenue is positive"
    }
  ]
}

Data Profiling

Sparvi Core provides comprehensive data profiling capabilities to help you understand your data's quality, structure, and characteristics.

Basic Profiling

from sparvi.profiler.profile_engine import profile_table

# Profile a table
profile = profile_table("snowflake://user:pass@account/database/schema?warehouse=wh", "customers")

# Access profile metrics
print(f"Row count: {profile['row_count']}")
print(f"Duplicate count: {profile['duplicate_count']}")

# Column completeness
for column, stats in profile["completeness"].items():
    print(f"{column}: {stats['null_percentage']}% null, {stats['distinct_percentage']}% distinct")

Profiling Features

Sparvi's profiler collects a wide range of metrics:

Basic Statistics
- Row count
- Duplicate row count
- Column counts
Completeness Analysis
- NULL value percentages
- Distinct value percentages
- Most frequent values
Numeric Column Analysis
- Min/max values
- Mean, median
- Standard deviation
- Quartiles (Q1, Q3)
- Outlier detection
Text Column Analysis
- Min/max/average lengths
- Pattern recognition (emails, phone numbers, etc.)
- Common prefixes/suffixes
Date Column Analysis
- Date ranges
- Distribution over time
- Seasonality detection
Sample Data Collection (optional)
- Representative samples for visual inspection

Using Profile Output

Profile results are returned as a nested dictionary that you can explore, save, or analyze:

# Save profile results to JSON
import json
with open("customer_profile.json", "w") as f:
    json.dump(profile, f, indent=2)

# Load saved profile for comparison
with open("customer_profile.json", "r") as f:
    historical_profile = json.load(f)

# Run profile with historical comparison
new_profile = profile_table(
    "snowflake://user:pass@account/database/schema?warehouse=wh", 
    "customers",
    historical_data=historical_profile
)

# Check for anomalies detected against historical data
if new_profile.get("anomalies"):
    print(f"Found {len(new_profile['anomalies'])} anomalies vs. historical data:")
    for anomaly in new_profile["anomalies"]:
        print(f"- {anomaly['description']} (severity: {anomaly['severity']})")

Command Line Profiling

Profile tables directly from the command line:

# Basic profiling
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers

# Save profile output
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers --output customer_profile.json

# Compare with historical profile
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers --compare previous_profile.json

# Format options (detailed, default, minimal)
sparvi profile "snowflake://user:pass@account/database/schema?warehouse=wh" customers --format detailed

The CLI renders nicely formatted tables in your terminal with color-coding for issues, making it easy to spot problems quickly.

Database Compatibility

Snowflake Considerations (Recommended Database)

When working with Snowflake, Sparvi automatically:

Uses Snowflake's efficient SAMPLE (n ROWS) syntax for better performance
Optimizes session parameters with USE_CACHED_RESULT = TRUE
Employs Snowflake's REGEXP_LIKE function for pattern matching
Sets appropriate query tags for monitoring
Leverages Snowflake's native date functions

To connect to Snowflake, you can:

Provide a connection string: snowflake://user:pass@account/database/schema?warehouse=wh

Or use environment variables:

SNOWFLAKE_USER=your_user
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_ACCOUNT=your_account
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_SCHEMA=your_schema
SNOWFLAKE_WAREHOUSE=your_warehouse

PostgreSQL Considerations

When working with PostgreSQL, keep in mind:

For date difference functions, we use PostgreSQL's DATE_PART function
Regex pattern matching uses PostgreSQL's ~ operator
When using the FILTER clause, ensure you have PostgreSQL 9.4 or higher

DuckDB Considerations

DuckDB is ideal for local analysis and development:

No server setup required
Fast local processing
Compatible with most SQL features
Perfect for testing validation rules before running on production data

Testing Your Setup

To verify your database connection and functionality, you can use:

# Test connection using CLI
sparvi test-connection --connection "your_connection_string"

# Or use environment variables
export SNOWFLAKE_USER=your_user
export SNOWFLAKE_PASSWORD=your_password
export SNOWFLAKE_ACCOUNT=your_account
export SNOWFLAKE_DATABASE=your_database
sparvi test-connection

Environment Variable Support

Sparvi Core supports using environment variables for sensitive connection information:

# For Snowflake (recommended)
export SNOWFLAKE_USER=your_user
export SNOWFLAKE_PASSWORD=your_password
export SNOWFLAKE_ACCOUNT=your_account
export SNOWFLAKE_DATABASE=your_database
export SNOWFLAKE_SCHEMA=public
export SNOWFLAKE_WAREHOUSE=compute_wh

# Generic connection string
export DATABASE_URL="your_connection_string"

When using the CLI or Python API without specifying a connection string, Sparvi will automatically use these environment variables.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request
For development setup, install with development dependencies: pip install -e ".[dev]"
Run tests with: pytest

License

Apache License 2.0

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

mhardman100

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.2

Oct 2, 2025

0.6.0

Aug 12, 2025

This version

0.5.1

Mar 10, 2025

0.4.2

Mar 6, 2025

0.4.1

Mar 6, 2025

0.4.0

Mar 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparvi_core-0.5.1.tar.gz (42.4 kB view details)

Uploaded Mar 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparvi_core-0.5.1-py3-none-any.whl (36.7 kB view details)

Uploaded Mar 10, 2025 Python 3

File details

Details for the file sparvi_core-0.5.1.tar.gz.

File metadata

Download URL: sparvi_core-0.5.1.tar.gz
Upload date: Mar 10, 2025
Size: 42.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sparvi_core-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`302d3e4c5adf8040dc00abe7a427f82b6367a33bd69a918368a15bd0a8e0f1c2`
MD5	`a17fa0a778c1b8c87e9e5099071c4bcc`
BLAKE2b-256	`7e0be86de57a55cc632ddaf606288ad57b15142ad60e95ca6c61898c26b56bf2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparvi_core-0.5.1.tar.gz:

Publisher: publish.yml on sparvidata/sparvi-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sparvi_core-0.5.1.tar.gz
- Subject digest: 302d3e4c5adf8040dc00abe7a427f82b6367a33bd69a918368a15bd0a8e0f1c2
- Sigstore transparency entry: 179571827
- Sigstore integration time: Mar 10, 2025
Source repository:
- Permalink: sparvidata/sparvi-core@225956fc58de7ae2bd085cfd684c26f24fd36457
- Branch / Tag: refs/tags/v0.5.2
- Owner: https://github.com/sparvidata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@225956fc58de7ae2bd085cfd684c26f24fd36457
- Trigger Event: push

File details

Details for the file sparvi_core-0.5.1-py3-none-any.whl.

File metadata

Download URL: sparvi_core-0.5.1-py3-none-any.whl
Upload date: Mar 10, 2025
Size: 36.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sparvi_core-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53bd604cbe968976c5eff218b791c578870d6448b2b1c2f4a453d1d27ed32b32`
MD5	`341da0661bd23e6d87c93093fb7bbbf1`
BLAKE2b-256	`f297abd9480e936346edd67065fe5facac2d2e2cc15c7159c407b19915ff2499`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparvi_core-0.5.1-py3-none-any.whl:

Publisher: publish.yml on sparvidata/sparvi-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sparvi_core-0.5.1-py3-none-any.whl
- Subject digest: 53bd604cbe968976c5eff218b791c578870d6448b2b1c2f4a453d1d27ed32b32
- Sigstore transparency entry: 179571828
- Sigstore integration time: Mar 10, 2025
Source repository:
- Permalink: sparvidata/sparvi-core@225956fc58de7ae2bd085cfd684c26f24fd36457
- Branch / Tag: refs/tags/v0.5.2
- Owner: https://github.com/sparvidata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@225956fc58de7ae2bd085cfd684c26f24fd36457
- Trigger Event: push

sparvi-core 0.5.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Sparvi Core

Features

Data Profiling

Data Validation

Installation

Quick Start

Command Line Interface

Creating Custom Validations

Python API

Validation Framework

Using Default Validations

Creating Custom Validations in YAML

Creating Custom Validations in Python

Available Validation Operators

Working with Validation Results

Data Profiling

Basic Profiling

Profiling Features

Using Profile Output

Command Line Profiling

Database Compatibility

Snowflake Considerations (Recommended Database)

PostgreSQL Considerations

DuckDB Considerations

Testing Your Setup

Environment Variable Support

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance