Skip to main content

SDK for scoring, tiering, and filtering data reliability evidence.

Project description

Data Reliability Index

CI PyPI Python Versions License GitHub Release Typed

Data Reliability Index core features infographic

Data Reliability Index is a typed Python SDK for classifying the reliability of measured, collected, derived, or user-submitted data before it is used for analysis, databases, APIs, machine learning, or scientific conclusions.

The project starts from a practical problem: many datasets contain values that are treated as facts even though their source, calibration, provenance, verification history, uncertainty, or integrity is unknown. In real workflows, measured data is often copied into databases, dashboards, research notebooks, and models without every data point being classified by reliability. That makes later analysis fragile, because high-quality measurements, partially verified records, historical observations, and weakly sourced submissions can all be mixed together as if they had the same evidential value.

Data Reliability Index is built around a simple rule: every data point should carry the evidence needed to decide whether it is safe to use. Each record should be scored, assigned a trust tier, linked to provenance and audit metadata, and filtered by explicit policy before it influences decisions. Without this kind of reliability classification, analysis can become difficult to reproduce, hard to defend, and scientifically weak because the trust level of the underlying data was never verified.

Supported Python versions: 3.9 through 3.14.

Release Status

Latest release: v0.6.1

The v0.6.1 GitHub Release includes signed source, a wheel, source distribution, and artifact provenance attestations. The package is published on PyPI as data-reliability-index.

Features

  • Pydantic models for reliability metadata and policies.
  • Scanning engine for computing reliability scores from validation evidence.
  • Tiered trust classification with scores from 0 to 100.
  • Automatic trust-tier assignment from score and verification signals.
  • Trace hash computation and verification support.
  • HMAC-SHA256 signature verification for authenticated payload integrity.
  • Policy-based acceptance checks for individual records.
  • Structured accept/reject decisions for audits and ingestion logs.
  • Evidence templates for common source types.
  • Driver-neutral database helpers for SQL, analytical, and document stores.
  • Batch and streaming row scanning for small files and large datasets.
  • Optional Pandas helpers for filtering DataFrames by reliability metadata.
  • FastAPI example for rejecting low-reliability input at ingestion time.
  • CLI for scanning JSON and JSONL records.
  • MkDocs documentation for concepts and API usage.

Installation

Install from PyPI:

pip install data-reliability-index

Optional integrations:

pip install "data-reliability-index[pandas]"
pip install "data-reliability-index[api]"
pip install "data-reliability-index[postgres]"
pip install "data-reliability-index[mysql]"
pip install "data-reliability-index[duckdb]"
pip install "data-reliability-index[mongo]"
pip install "data-reliability-index[polars]"
pip install "data-reliability-index[arrow]"

You can also install the latest GitHub Release wheel directly:

pip install https://github.com/h3pdesign/data-reliability-index/releases/download/v0.6.1/data_reliability_index-0.6.1-py3-none-any.whl

For local development from this repository:

pip install -e ".[test]"

Quick Start

from data_reliability import DataTier, ReliabilityPolicy, ReliabilityScanner, evidence_from_template

scanner = ReliabilityScanner()
evidence = evidence_from_template(
    "verified-sensor",
    calibration_version="sensor-cal-2026-06",
)

data = scanner.scan(
    {"temperature": 21.4, "unit": "celsius"},
    source_id="sensor-a",
    evidence=evidence,
)

policy = ReliabilityPolicy(
    minimum_score=90,
    maximum_tier=DataTier.TIER_2,
)

assert policy.resolve(data) == {"temperature": 21.4, "unit": "celsius"}
assert policy.assess(data.reliability).accepted is True

How the Data Reliability Index Works

The Data Reliability Index models the complete lifecycle of a data point as it moves from raw input into a trusted dataset. Rather than treating all data equally, the system evaluates, scores, and classifies every record before it becomes eligible for trusted use.

Raw data can enter from APIs, IoT devices, calibrated sensors, databases, files, and user submissions. Because these sources differ in quality and provenance, every incoming record starts as untrusted until it has been analyzed.

The scanning engine evaluates validation evidence for completeness, consistency, provenance, cryptographic verification, calibration, schema compliance, anomaly detection, duplicate detection, and metadata quality. These checks provide an objective assessment of how trustworthy each data point is.

Each scan produces:

  • A numeric reliability score from 0 to 100, representing overall confidence.
  • A standardized trust tier from TIER_1 to TIER_3, describing source and verification level.
  • A trace hash that can be used to verify whether the data changed.
  • The scoring profile name and version used for reproducibility.
  • A ReliableData wrapper containing the original value and its reliability metadata.

Together, the numeric score and trust tier provide more information than either metric alone. The score enables precise filtering and ranking, while the tier gives an immediately understandable description of the data's verification level.

Evidence Templates

Templates provide conservative starting points so users do not need to manually set every evidence value for common source types:

  • verified-sensor
  • trusted-api
  • cleaned-dataset
  • user-submission
  • historical-record
  • climate-station

Use templates as defaults and override only values you can justify:

from data_reliability import evidence_from_template

evidence = evidence_from_template(
    "trusted-api",
    cryptographic_verification=1.0,
    notes=["Provider payload signature verified at ingestion."],
)

Trust Tiers

The system classifies data into three standardized trust levels:

Tier Trust level Typical sources Intended use
TIER_1 Highest trust Cryptographically verified data, calibrated sensor measurements, direct measurements from trusted APIs Scientific, safety-critical, and mission-critical applications
TIER_2 High trust Cleaned secondary datasets, indirect measurements, partially validated or derived information Analytics, forecasting, and most production workloads
TIER_3 Moderate trust User-generated content, self-reported information, weakly verified external sources Exploratory analysis and workflows requiring additional validation

Intelligent Decision Pipeline

Based on the analysis results, data follows one of two paths:

  • Rejected data: records failing minimum reliability requirements are isolated and excluded from trusted datasets.
  • Accepted data: records meeting validation thresholds can be stored in trusted repositories while remaining fully traceable.

Every accepted record retains its reliability score, trust tier, provenance metadata, validation evidence, cryptographic integrity information, and audit context. Trust is not only calculated once; it remains transparent and reproducible throughout the data lifecycle.

Transparent Scoring

The scanner creates a numeric score by applying profile weights to the evidence dimensions. Each evidence value is normalized from 0.0 to 1.0, multiplied by the active profile's weight, and converted to a 0 to 100 score. Missing timestamps apply an additional penalty when timestamp verification is required.

Trust tiers are then assigned by explicit profile criteria:

  • TIER_1: the record must pass the Tier 1 minimum score and all required evidence thresholds.
  • TIER_2: the record must pass the Tier 2 minimum score and all required evidence thresholds.
  • TIER_3: fallback for records that do not satisfy Tier 1 or Tier 2.

The package includes three profiles:

Profile Purpose Tier behavior
default_profile() General application, API, and analytics data Tier 1 requires high score, provenance, cryptographic verification, schema compliance, and verified timestamp
scientific_profile() Research data across scientific fields Increases weight and thresholds for provenance, calibration, consistency, anomaly checks, and reproducibility signals
climate_record_profile() Weather and climate record data Emphasizes calibrated instruments, station metadata, consistency with comparison stations, anomaly checks, and provenance

The scoring model is stable within a profile. Different use cases change their acceptance thresholds or active profile, not the meaning of reliability itself.

Examples:

  • A medical research study may require TIER_1 data with a score of 99 or higher.
  • An autonomous vehicle system may accept only TIER_1 data above 95.
  • Financial risk models may require TIER_1 or high-quality TIER_2 data.
  • Product analytics might accept TIER_2 data with a threshold of 75.
  • Research prototypes may intentionally include TIER_3 data while applying lower confidence weights.

Because every dataset is evaluated using the same transparent methodology, organizations can define their own acceptance criteria without changing how reliability is measured.

Instead of asking whether a dataset can be trusted, users can inspect objective metrics: reliability score, standardized trust tier, provenance, and validation history.

You can inspect why a record received its tier:

from data_reliability import ReliabilityScanner, ValidationEvidence, climate_record_profile

scanner = ReliabilityScanner(profile=climate_record_profile())
evidence = ValidationEvidence(
    completeness=0.98,
    consistency=0.96,
    provenance=0.96,
    calibration=0.96,
    schema_compliance=0.92,
    anomaly_detection=0.96,
    metadata_quality=0.60,
)

print(scanner.tier_evaluation(evidence))

For climate records, this makes methodological uncertainty visible. A temperature record with strong provenance and calibration can still fail Tier 1 if station metadata, anomaly checks, timestamp verification, or comparison-station consistency are insufficient.

Authenticated Payloads

For trusted ingestion paths, the SDK can verify an upstream HMAC-SHA256 signature before raising cryptographic evidence:

from data_reliability import ReliabilityScanner, compute_hmac_signature

value = {"temperature": 21.4, "unit": "celsius"}
signature = compute_hmac_signature(value, "sensor-a", "shared-secret")

reliable = ReliabilityScanner().scan(
    value,
    source_id="sensor-a",
    expected_signature=signature,
    signing_secret="shared-secret",
)

Use HMAC secrets only for authenticated systems you control. For public third-party signatures, use the provider's signature scheme before passing the result into DRI evidence.

For payload integrity checks, see Integrity Checks. For value-quality checks against predefined ground truth or reference values, see Reference Comparisons. For research workflows, see Scientific Use.

Ground Truth and Reference Checks

For scientific or quality-controlled workflows, define reference values before the data is scored. The index can then show how well observed values match the predefined ground truth, tolerance, source, and method.

from data_reliability import (
    ReferenceValue,
    ReliabilityScanner,
    ValidationEvidence,
    compare_to_references,
    evidence_from_reference_comparison,
    scientific_profile,
)

record = {"temperature": 21.4, "humidity": 48.0, "unit": "metric"}

comparison = compare_to_references(
    record,
    [
        ReferenceValue(
            field="temperature",
            value=21.5,
            tolerance=0.2,
            reference_id="lab-temp-gt",
            source="validated lab baseline",
            unit="celsius",
        ),
        ReferenceValue(
            field="humidity",
            value=50.0,
            tolerance=3.0,
            reference_id="lab-humidity-gt",
            source="validated lab baseline",
            unit="percent",
        ),
    ],
)

evidence = evidence_from_reference_comparison(
    comparison,
    base=ValidationEvidence(
        provenance=1.0,
        calibration=1.0,
        schema_compliance=1.0,
        metadata_quality=1.0,
    ),
)

reliable = ReliabilityScanner(profile=scientific_profile()).scan(
    record,
    source_id="lab-sensor-a",
    evidence=evidence,
)

print(comparison.passed)
print(comparison.quality_score)
print(reliable.reliability.score)

This is different from hash verification. Hashes show whether a payload changed. Reference checks show whether values are close enough to predefined quality targets.

Database Usage

The SDK does not require a database driver. It emits plain dictionaries so the same reliability metadata can be stored in small local databases such as SQLite and DuckDB, production SQL databases such as PostgreSQL and MySQL, analytical warehouses, or document stores.

from data_reliability import ValidationEvidence, scan_row

row = {"temperature": 21.4, "unit": "celsius"}
scored_row = scan_row(
    row,
    source_id="sensor-a",
    evidence=ValidationEvidence(cryptographic_verification=1.0, calibration=1.0),
    required_fields=["temperature", "unit"],
)

assert scored_row["dri_score"] >= 90
assert scored_row["dri_tier"] == 1

For large datasets, stream rows instead of materializing everything:

from data_reliability import iter_scan_rows

for scored_row in iter_scan_rows(rows, source_id_field="id", required_fields=["temperature", "unit"]):
    write_to_database(scored_row)

Generate reliability column definitions for common SQL engines:

from data_reliability import reliability_columns_ddl

print(reliability_columns_ddl(dialect="postgres"))

For SQL tables, store the generated dri_* columns beside the source data. For document databases, store metadata_to_document(reliable.reliability) as a nested reliability object.

For rejected records, store decision_to_document(policy.assess(reliable.reliability)) in quarantine or ingestion logs so policy reasons remain auditable.

CLI Usage

dri scan records.jsonl --jsonl --source-id-field id --required-field temperature --required-field unit

The CLI emits one JSON result per scanned record with the original value, reliability metadata, and policy decision.

SDK and Security Notes

This project is intended to be used as an SDK. The core package keeps dependencies small, exposes typed Pydantic models, avoids dynamic code execution, and uses deterministic SHA-256 trace hashes to detect changed payloads. Hashes are integrity signals, not proof of source identity by themselves; use authenticated ingestion, HMAC or provider signature verification, database permissions, and private vulnerability reporting for production systems.

Pandas Filtering

import pandas as pd
from data_reliability import DataTier, ReliabilityMetadata, ReliabilityPolicy, filter_reliable_df

df = pd.DataFrame([
    {
        "value": 10,
        "reliability": ReliabilityMetadata(
            score=95,
            tier=DataTier.TIER_1,
            source_id="sensor-a",
            trace_hash="abc123",
        ),
    },
])

policy = ReliabilityPolicy(minimum_score=90, maximum_tier=DataTier.TIER_2)
trusted = filter_reliable_df(df, policy)

Documentation

The project documentation lives in docs/ and can be served locally with MkDocs:

pip install -e ".[docs]"
mkdocs serve

Start with:

The longer project rationale is available in data-reliability.md.

Development

Run the test suite:

pip install -e ".[test]"
pytest

Build package artifacts:

pip install -e ".[build]"
python -m build
python -m twine check dist/*

Run the FastAPI example:

pip install -e ".[api]"
uvicorn examples.fastapi_app:app --reload

Contributing

Contributions are welcome. See CONTRIBUTING.md for the local workflow and pull request expectations.

Security

Please report security issues privately. See SECURITY.md.

License

Licensed under the Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_reliability_index-0.6.1.tar.gz (34.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_reliability_index-0.6.1-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file data_reliability_index-0.6.1.tar.gz.

File metadata

  • Download URL: data_reliability_index-0.6.1.tar.gz
  • Upload date:
  • Size: 34.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_reliability_index-0.6.1.tar.gz
Algorithm Hash digest
SHA256 793c28d8d717f0382fbc4dd10fb5d18244273678c953838ff7b2fec012fb0755
MD5 6084d1291544d19a24a27f7750a0ba73
BLAKE2b-256 be92942e03101cbbc95d19a7c35a6877497e85aa06c80b74b440ed37691ed7b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_reliability_index-0.6.1.tar.gz:

Publisher: publish.yml on h3pdesign/data-reliability-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_reliability_index-0.6.1-py3-none-any.whl.

File metadata

File hashes

Hashes for data_reliability_index-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bb968d1bdfb028a851946037e9d0d13bac3b77e7c403fd9ee1e828cb13ec2efc
MD5 90f39b611e6ca3f200759e03bdb400ba
BLAKE2b-256 684d6fbb57d03f7b57a9410eca803db8dcd1211ed324525e42027856c99598c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_reliability_index-0.6.1-py3-none-any.whl:

Publisher: publish.yml on h3pdesign/data-reliability-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page