Skip to main content

The missing validation layer between unstructured data and AI systems

Project description

๐Ÿ›ก๏ธ DaytaShield

PyPI version Python 3.10+ License

The missing validation layer between unstructured data and AI systems.

DaytaShield validates multimodal data (PDFs, CSVs, JSON, images) before it reaches your RAG pipelines, AI agents, or analytics systems. Stop hallucinations at the source.

๐Ÿš€ Quick Start

pip install daytashield
from daytashield import ValidationPipeline, SchemaValidator, FreshnessValidator

# Create a validation pipeline
pipeline = ValidationPipeline([
    SchemaValidator(schema={"type": "object", "required": ["id", "content"]}),
    FreshnessValidator(max_age="7d"),
])

# Validate your data
result = pipeline.validate({
    "id": 1,
    "content": "Hello world",
    "timestamp": "2024-01-15"
})

print(result.status)  # ValidationStatus.PASSED

โœจ Features

  • ๐Ÿ“‹ Schema Validation - JSON Schema + Pydantic model validation
  • ๐Ÿง  Semantic Validation - LLM-powered content validation
  • โฐ Freshness Checks - Detect stale data before it causes problems
  • ๐Ÿ”’ Compliance Rules - Built-in HIPAA, GDPR, and PII detection
  • ๐Ÿ“„ Document Processing - PDF, CSV, JSON extraction and validation
  • ๐Ÿ”— LangChain Integration - Validated retrievers for RAG pipelines
  • ๐Ÿ“Š Audit Trail - Immutable logging for compliance

๐Ÿ“– Usage

Validate Files

from daytashield import ValidationPipeline, SchemaValidator, PDFProcessor

# Create pipeline with processors
pipeline = ValidationPipeline([
    SchemaValidator(schema=invoice_schema),
])
pipeline.add_processor(".pdf", PDFProcessor())

# Validate a PDF
result = pipeline.validate_file("invoice.pdf")
if result.failed:
    for error in result.errors:
        print(f"Error: {error.message}")

Compliance Checking

from daytashield import ValidationPipeline, ComplianceValidator

# Check for HIPAA and PII violations
pipeline = ValidationPipeline([
    ComplianceValidator(rules=["hipaa", "pii"]),
])

result = pipeline.validate(patient_data)
for message in result.messages:
    print(f"{message.severity}: {message.message}")

LangChain Integration

from langchain_community.vectorstores import FAISS
from daytashield import SchemaValidator, FreshnessValidator
from daytashield.integrations.langchain import ValidatedRetriever

# Wrap your retriever with validation
retriever = ValidatedRetriever(
    base_retriever=vectorstore.as_retriever(),
    validators=[
        SchemaValidator(schema=doc_schema),
        FreshnessValidator(max_age="7d"),
    ],
    on_fail="filter",  # Remove invalid documents
)

# Use like any LangChain retriever
docs = retriever.invoke("What is the refund policy?")

Routing Based on Validation

from daytashield import ValidationPipeline, DataRouter, RouteAction

pipeline = ValidationPipeline([...])
router = DataRouter()

result = pipeline.validate(data)
decision = router.route(result)

if decision.route.action == RouteAction.PASS:
    send_to_destination(result.data)
elif decision.route.action == RouteAction.QUARANTINE:
    quarantine_for_review(result.data, decision.reason)

๐Ÿ–ฅ๏ธ CLI

# Validate files
daytashield validate invoice.pdf --schema invoice.json

# Validate with compliance rules
daytashield validate ./data/ --rules hipaa --rules pii

# Watch directory for new files
daytashield watch ./incoming/ --rules hipaa --audit audit.jsonl

# Query audit log
daytashield audit audit.jsonl --status failed --limit 10

๐Ÿ“ฆ Validators

Validator Description
SchemaValidator JSON Schema and Pydantic validation
SemanticValidator LLM-based content validation
FreshnessValidator Timestamp and staleness checks
ComplianceValidator HIPAA, GDPR, PII rule enforcement

๐Ÿ“„ Processors

Processor Formats Description
PDFProcessor .pdf Text extraction with pdfplumber
CSVProcessor .csv, .tsv Tabular data with pandas
JSONProcessor .json, .jsonl Structured data with orjson

๐Ÿ”’ Compliance Rules

Rule Pack Coverage
hipaa PHI detection, medical record numbers, health plan IDs
gdpr Consent checking, special category data, data minimization
pii SSN, credit cards, emails, phone numbers, IP addresses

๐Ÿ“Š Audit Trail

DaytaShield maintains an immutable audit log of all validation operations:

from daytashield import AuditTrail, ValidationPipeline

# Enable audit logging
audit = AuditTrail("./audit.jsonl")
pipeline = ValidationPipeline([...])

result = pipeline.validate(data)
audit.log(result)

# Query the audit trail
for entry in audit.query(status=ValidationStatus.FAILED):
    print(f"Failed: {entry.source_id} at {entry.timestamp}")

# Get statistics
stats = audit.stats()
print(f"Pass rate: {stats['by_status']['passed'] / stats['total'] * 100:.1f}%")

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Source    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Processor  โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Validators โ”‚
โ”‚ PDF/CSV/JSONโ”‚     โ”‚  Extract    โ”‚     โ”‚  Schema     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  Semantic   โ”‚
                                        โ”‚  Freshness  โ”‚
                                        โ”‚  Compliance โ”‚
                                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                               โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   Audit     โ”‚โ—€โ”€โ”€โ”€โ”€โ”‚   Router    โ”‚
                    โ”‚   Trail     โ”‚     โ”‚  Pass/Warn  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  /Fail      โ”‚
                                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ง Configuration

from daytashield import ValidationPipeline, PipelineConfig

pipeline = ValidationPipeline(
    validators=[...],
    config=PipelineConfig(
        fail_fast=True,          # Stop on first failure
        include_original_data=True,  # Keep original data in result
        auto_detect_processor=True,  # Auto-select processor by extension
    ),
)

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

# Clone the repo
git clone https://github.com/daytashield/daytashield.git
cd daytashield

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check src tests

๐Ÿ“„ License

Apache 2.0 - see LICENSE for details.

๐Ÿ”— Links


Built with โค๏ธ for the AI community

Stop bad data at the source. Validate before you hallucinate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daytashield-0.1.1.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

daytashield-0.1.1-py3-none-any.whl (59.2 kB view details)

Uploaded Python 3

File details

Details for the file daytashield-0.1.1.tar.gz.

File metadata

  • Download URL: daytashield-0.1.1.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for daytashield-0.1.1.tar.gz
Algorithm Hash digest
SHA256 baeacc5e5b86d6365a7a5a4b0d88830b170cfd727feebcd166cb18ded6eba9c9
MD5 e74e4f3e3f804cc3dff7336b2bc137d5
BLAKE2b-256 dd031262f4978f87d8a028cee666145d4dbf1dbadd657e70017b69b1f4620ae2

See more details on using hashes here.

File details

Details for the file daytashield-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: daytashield-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 59.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for daytashield-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 668c6f912974f4f2cce77aa29817c68ee22f324a9eabd2ca5071b08bb8fa70ab
MD5 4113b3b9801718823636066012f11aea
BLAKE2b-256 fbaf1a47b430e2a8bace7e5ea35577e947be474e195e7ff3b13d451319640382

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page