Skip to main content

Data validation library for audit pipelines using Polars

Project description

lokryn-pipe-audit

Data validation library for audit pipelines using Polars. Define validation contracts in TOML and validate data from local files, S3, GCS, or Azure Blob Storage.

Installation

pip install lokryn-pipe-audit

Quick Start

from lokryn_pipe_audit import load_contract, validate_dataframe, get_driver

# Load a validation contract
contract = load_contract("contracts/users.toml")

# Load data with the appropriate driver
driver = get_driver("csv")
df = driver.load(open("users.csv", "rb").read())

# Validate
outcome = validate_dataframe(df, contract)

if outcome.passed:
    print("Validation passed!")
else:
    for failure in outcome.failures:
        print(f"Failed: {failure.rule} on {failure.column}")

Contracts

Define validation rules in TOML:

[contract]
name = "users"
version = "1.0"
format = "csv"

[[columns]]
name = "email"
rules = [
    { rule = "not_null" },
    { rule = "unique" },
    { rule = "pattern", pattern = "^[\\w.-]+@[\\w.-]+\\.\\w+$" }
]

[[columns]]
name = "age"
rules = [
    { rule = "not_null" },
    { rule = "range", min = 0, max = 150 }
]

[[columns]]
name = "status"
rules = [
    { rule = "in_set", values = ["active", "inactive", "pending"] }
]

Built-in Validators

Validator Description Parameters
not_null No null values -
unique All values unique -
pattern Regex match pattern
range Numeric range min, max
in_set Value in allowed set values
completeness % non-null above threshold threshold
mean_between Column mean in range min, max
row_count Row count in range min, max
compound_unique Unique across columns columns
date_format Date string format format
outlier_sigma No outliers beyond N sigma sigma

Storage Connectors

Local

from lokryn_pipe_audit import LocalConnector

connector = LocalConnector()
data = await connector.fetch("/path/to/file.csv")

S3

from lokryn_pipe_audit import S3Connector, load_profiles, get_profile

profiles = load_profiles("profiles.toml")
profile = get_profile(profiles, "my_s3_profile")

connector = S3Connector.from_profile_and_url(profile, "s3://bucket/key")
data = await connector.fetch("s3://bucket/data.csv")

GCS

from lokryn_pipe_audit import GCSConnector, load_profiles, get_profile

profiles = load_profiles("profiles.toml")
profile = get_profile(profiles, "my_gcs_profile")

connector = GCSConnector.from_profile_and_url(profile, "gs://bucket/key")
data = await connector.fetch("gs://bucket/data.csv")

Azure Blob Storage

from lokryn_pipe_audit import AzureConnector, load_profiles, get_profile

profiles = load_profiles("profiles.toml")
profile = get_profile(profiles, "my_azure_profile")

connector = AzureConnector.from_profile_and_url(profile, url)
data = await connector.fetch("https://account.blob.core.windows.net/container/blob")

Profiles

Configure storage credentials in profiles.toml:

[s3_profile]
provider = "s3"
region = "us-east-1"
access_key = "${AWS_ACCESS_KEY_ID}"
secret_key = "${AWS_SECRET_ACCESS_KEY}"

[gcs_profile]
provider = "gcs"
service_account_json = "${GCS_SERVICE_ACCOUNT_JSON}"

[azure_profile]
provider = "azure"
connection_string = "${AZURE_STORAGE_CONNECTION_STRING}"

Environment variables in ${VAR} format are automatically expanded.

File Formats

  • CSV (.csv)
  • Parquet (.parquet)

License

AGPL-3.0 - See LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lokryn_pipe_audit-0.1.0.tar.gz (41.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lokryn_pipe_audit-0.1.0-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file lokryn_pipe_audit-0.1.0.tar.gz.

File metadata

  • Download URL: lokryn_pipe_audit-0.1.0.tar.gz
  • Upload date:
  • Size: 41.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lokryn_pipe_audit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 006df47125d6bee18dc8a4837363dd87db59b6107ea827dcaf3e1bc444268abb
MD5 3a5b4dda959b82e25be4e457481befa7
BLAKE2b-256 6d5f1386373794b57173b37eda52a6ae7e341b0f58e9b2fcf6b0ee92f1223e84

See more details on using hashes here.

Provenance

The following attestation bundles were made for lokryn_pipe_audit-0.1.0.tar.gz:

Publisher: publish.yml on lokryn-llc/pipe-audit-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lokryn_pipe_audit-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for lokryn_pipe_audit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1da43395a3ed3c9d2c774f9f98eec91dae074cdcf77230874a17e2df2e2dd1c
MD5 09832ee0de573870928dcad17b7c71e0
BLAKE2b-256 04a918830c07a8a4a53b9a41491105482d90615af631712595fb3c168b87124e

See more details on using hashes here.

Provenance

The following attestation bundles were made for lokryn_pipe_audit-0.1.0-py3-none-any.whl:

Publisher: publish.yml on lokryn-llc/pipe-audit-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page