The missing validation layer between unstructured data and AI systems
Project description
๐ก๏ธ DaytaShield
The missing validation layer between unstructured data and AI systems.
DaytaShield validates multimodal data (PDFs, CSVs, JSON, images) before it reaches your RAG pipelines, AI agents, or analytics systems. Stop hallucinations at the source.
๐ Quick Start
pip install daytashield
from daytashield import ValidationPipeline, SchemaValidator, FreshnessValidator
# Create a validation pipeline
pipeline = ValidationPipeline([
SchemaValidator(schema={"type": "object", "required": ["id", "content"]}),
FreshnessValidator(max_age="7d"),
])
# Validate your data
result = pipeline.validate({
"id": 1,
"content": "Hello world",
"timestamp": "2024-01-15"
})
print(result.status) # ValidationStatus.PASSED
โจ Features
- ๐ Schema Validation - JSON Schema + Pydantic model validation
- ๐ง Semantic Validation - LLM-powered content validation
- โฐ Freshness Checks - Detect stale data before it causes problems
- ๐ Compliance Rules - Built-in HIPAA, GDPR, and PII detection
- ๐ Document Processing - PDF, CSV, JSON extraction and validation
- ๐ LangChain Integration - Validated retrievers for RAG pipelines
- ๐ Audit Trail - Immutable logging for compliance
๐ Usage
Validate Files
from daytashield import ValidationPipeline, SchemaValidator, PDFProcessor
# Create pipeline with processors
pipeline = ValidationPipeline([
SchemaValidator(schema=invoice_schema),
])
pipeline.add_processor(".pdf", PDFProcessor())
# Validate a PDF
result = pipeline.validate_file("invoice.pdf")
if result.failed:
for error in result.errors:
print(f"Error: {error.message}")
Compliance Checking
from daytashield import ValidationPipeline, ComplianceValidator
# Check for HIPAA and PII violations
pipeline = ValidationPipeline([
ComplianceValidator(rules=["hipaa", "pii"]),
])
result = pipeline.validate(patient_data)
for message in result.messages:
print(f"{message.severity}: {message.message}")
LangChain Integration
from langchain_community.vectorstores import FAISS
from daytashield import SchemaValidator, FreshnessValidator
from daytashield.integrations.langchain import ValidatedRetriever
# Wrap your retriever with validation
retriever = ValidatedRetriever(
base_retriever=vectorstore.as_retriever(),
validators=[
SchemaValidator(schema=doc_schema),
FreshnessValidator(max_age="7d"),
],
on_fail="filter", # Remove invalid documents
)
# Use like any LangChain retriever
docs = retriever.invoke("What is the refund policy?")
Routing Based on Validation
from daytashield import ValidationPipeline, DataRouter, RouteAction
pipeline = ValidationPipeline([...])
router = DataRouter()
result = pipeline.validate(data)
decision = router.route(result)
if decision.route.action == RouteAction.PASS:
send_to_destination(result.data)
elif decision.route.action == RouteAction.QUARANTINE:
quarantine_for_review(result.data, decision.reason)
๐ฅ๏ธ CLI
# Validate files
daytashield validate invoice.pdf --schema invoice.json
# Validate with compliance rules
daytashield validate ./data/ --rules hipaa --rules pii
# Watch directory for new files
daytashield watch ./incoming/ --rules hipaa --audit audit.jsonl
# Query audit log
daytashield audit audit.jsonl --status failed --limit 10
๐ฆ Validators
| Validator | Description |
|---|---|
SchemaValidator |
JSON Schema and Pydantic validation |
SemanticValidator |
LLM-based content validation |
FreshnessValidator |
Timestamp and staleness checks |
ComplianceValidator |
HIPAA, GDPR, PII rule enforcement |
๐ Processors
| Processor | Formats | Description |
|---|---|---|
PDFProcessor |
.pdf |
Text extraction with pdfplumber |
CSVProcessor |
.csv, .tsv |
Tabular data with pandas |
JSONProcessor |
.json, .jsonl |
Structured data with orjson |
๐ Compliance Rules
| Rule Pack | Coverage |
|---|---|
hipaa |
PHI detection, medical record numbers, health plan IDs |
gdpr |
Consent checking, special category data, data minimization |
pii |
SSN, credit cards, emails, phone numbers, IP addresses |
๐ Audit Trail
DaytaShield maintains an immutable audit log of all validation operations:
from daytashield import AuditTrail, ValidationPipeline
# Enable audit logging
audit = AuditTrail("./audit.jsonl")
pipeline = ValidationPipeline([...])
result = pipeline.validate(data)
audit.log(result)
# Query the audit trail
for entry in audit.query(status=ValidationStatus.FAILED):
print(f"Failed: {entry.source_id} at {entry.timestamp}")
# Get statistics
stats = audit.stats()
print(f"Pass rate: {stats['by_status']['passed'] / stats['total'] * 100:.1f}%")
๐๏ธ Architecture
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Source โโโโโโถโ Processor โโโโโโถโ Validators โ
โ PDF/CSV/JSONโ โ Extract โ โ Schema โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ Semantic โ
โ Freshness โ
โ Compliance โ
โโโโโโโโฌโโโโโโโ
โ
โโโโโโโโโโโโโโโ โโโโโโโโผโโโโโโโ
โ Audit โโโโโโโ Router โ
โ Trail โ โ Pass/Warn โ
โโโโโโโโโโโโโโโ โ /Fail โ
โโโโโโโโโโโโโโโ
๐ง Configuration
from daytashield import ValidationPipeline, PipelineConfig
pipeline = ValidationPipeline(
validators=[...],
config=PipelineConfig(
fail_fast=True, # Stop on first failure
include_original_data=True, # Keep original data in result
auto_detect_processor=True, # Auto-select processor by extension
),
)
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Clone the repo
git clone https://github.com/daytashield/daytashield.git
cd daytashield
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
ruff check src tests
๐ License
Apache 2.0 - see LICENSE for details.
๐ Links
Built with โค๏ธ for the AI community
Stop bad data at the source. Validate before you hallucinate.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file daytashield-0.1.1.tar.gz.
File metadata
- Download URL: daytashield-0.1.1.tar.gz
- Upload date:
- Size: 43.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
baeacc5e5b86d6365a7a5a4b0d88830b170cfd727feebcd166cb18ded6eba9c9
|
|
| MD5 |
e74e4f3e3f804cc3dff7336b2bc137d5
|
|
| BLAKE2b-256 |
dd031262f4978f87d8a028cee666145d4dbf1dbadd657e70017b69b1f4620ae2
|
File details
Details for the file daytashield-0.1.1-py3-none-any.whl.
File metadata
- Download URL: daytashield-0.1.1-py3-none-any.whl
- Upload date:
- Size: 59.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
668c6f912974f4f2cce77aa29817c68ee22f324a9eabd2ca5071b08bb8fa70ab
|
|
| MD5 |
4113b3b9801718823636066012f11aea
|
|
| BLAKE2b-256 |
fbaf1a47b430e2a8bace7e5ea35577e947be474e195e7ff3b13d451319640382
|