Production-grade Python toolkit for building GDPR-compliant RAG systems with automatic PII detection, audit trails, and compliance validation
Project description
GDPR Safe RAG
Production-grade Python toolkit for GDPR-compliant RAG (Retrieval-Augmented Generation) systems with automatic PII detection, audit trails, and compliance validation.
Features
- PII Detection: Automatic detection of personal identifiable information with support for UK, EU, and common patterns
- Redaction Strategies: Multiple redaction approaches (token, hash, category) for different use cases
- Audit Logging: Comprehensive audit trails with PostgreSQL or SQLite backends
- Compliance Checking: Built-in GDPR compliance validation with detailed reports
- Async-First: Full async/await support for modern Python applications
- Type Safe: Complete type hints with mypy strict mode compatibility
Quick Start
Installation
pip install gdpr-safe-rag
Or install from source:
git clone https://github.com/Charlescifix/gdpr-safe-rag.git
cd gdpr-safe-rag
pip install -e .
Basic PII Detection
from gdpr_safe_rag import PIIDetector
# Initialize detector with UK patterns
detector = PIIDetector(region="UK", detection_level="strict")
# Sample text with PII
text = """
Contact John Smith at john.smith@example.com or call 07700 900123.
His NHS number is 123 456 7890.
"""
# Detect PII
items = detector.detect(text)
for item in items:
print(f"{item.type}: {item.value} (confidence: {item.confidence:.2f})")
# Redact PII
result = detector.redact(text)
print(result.redacted_text)
# Output: Contact John Smith at [EMAIL_1] or call [PHONE_1].
# His NHS number is [NHS_NUMBER_1].
# Get mapping for potential restoration
print(result.mapping)
# Output: {'[EMAIL_1]': 'john.smith@example.com', ...}
Audit Logging
import asyncio
from gdpr_safe_rag import AuditLogger
async def main():
async with AuditLogger(storage_path="audit.db") as logger:
# Log document ingestion
await logger.log_ingestion(
document_id="doc-001",
user_id="admin",
pii_detected=True,
pii_count=5,
)
# Log user query
await logger.log_query(
user_id="user-123",
query_text="What is the company policy?",
retrieved_docs=["doc-001", "doc-002"],
)
# Log data deletion (right to erasure)
await logger.log_deletion(
user_id="user-456",
resource="user-456-data",
reason="user_request",
)
# Export compliance report
from datetime import datetime, timedelta
report = await logger.export_compliance_report(
start_date=datetime.now() - timedelta(days=30),
end_date=datetime.now(),
)
print(report)
asyncio.run(main())
Compliance Checking
import asyncio
from gdpr_safe_rag import ComplianceChecker, AuditLogger
async def main():
# Sample documents with metadata
documents = [
{
"id": "doc-001",
"created_at": datetime.now() - timedelta(days=30),
"pii_detected": True,
"pii_count": 5,
},
]
async with AuditLogger(storage_path="audit.db") as logger:
checker = ComplianceChecker(retention_days=2555)
report = await checker.run_all_checks(
documents=documents,
audit_logger=logger,
)
print(report.to_text())
# Shows detailed compliance status with remediation suggestions
asyncio.run(main())
Supported PII Types
UK Patterns
- Email addresses
- Phone numbers (mobile and landline)
- UK postcodes
- NHS numbers (with checksum validation)
- National Insurance numbers
- Credit card numbers (with Luhn validation)
- IBAN (with modulo 97 validation)
EU Patterns
- Email addresses
- Phone numbers
- Credit card numbers
- IBAN
Configuration
Configuration can be set via environment variables or .env file:
# Database
DATABASE_URL=postgresql+asyncpg://user:pass@localhost:5432/gdpr_rag
# PII Detection
PII_DETECTION_LEVEL=strict # strict, moderate, lenient
# Audit Settings
AUDIT_RETENTION_DAYS=2555 # ~7 years
Development
Setup
# Clone repository
git clone https://github.com/Charlescifix/gdpr-safe-rag.git
cd gdpr-safe-rag
# Install development dependencies
pip install -e ".[dev]"
# Start PostgreSQL for testing (optional)
docker-compose up -d postgres
Running Tests
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=gdpr_safe_rag --cov-report=html
# Run type checking
mypy gdpr_safe_rag --strict
# Run linting
ruff check gdpr_safe_rag
black --check gdpr_safe_rag
Running Examples
# PII Detection (no database needed)
python examples/basic_pii_detection.py
# Audit Logging
python examples/audit_logging_postgres.py
# Compliance Check
python examples/compliance_check.py
Architecture
gdpr_safe_rag/
├── pii_detector/ # PII detection and redaction
│ ├── detector.py # Main PIIDetector class
│ ├── patterns/ # Pattern definitions by region
│ ├── redactor.py # Redaction strategies
│ └── validators.py # Checksum validators
├── audit_logger/ # Audit trail functionality
│ ├── logger.py # Main AuditLogger class
│ ├── backends/ # Storage backends (PostgreSQL, SQLite, Memory)
│ └── exporters.py # Report export functionality
└── compliance_checker/ # GDPR compliance validation
├── checker.py # Main ComplianceChecker class
└── checks/ # Individual compliance checks
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please read our contributing guidelines and submit pull requests.
Support
- GitHub Issues: Report bugs or request features
- Documentation: Full documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gdpr_safe_rag-0.1.0.tar.gz.
File metadata
- Download URL: gdpr_safe_rag-0.1.0.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4ed3136d211e746ae039800365e05a661d211786adb037018b9baf344b03729
|
|
| MD5 |
92f1ee8adff2b5c6d656d742f3e154e9
|
|
| BLAKE2b-256 |
ba2f6a7e67568c0eba5cca6f98f3d97a675b09c9da4c73bab8e761b7904e3f6f
|
File details
Details for the file gdpr_safe_rag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gdpr_safe_rag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3b16ba3d8e24a30caad1118a59f9f7b3a216a44f0565c2381e69a8ec7655ec8
|
|
| MD5 |
edcd8e8b747b4650853c40e25d6cebea
|
|
| BLAKE2b-256 |
1a268438f947ff2bcb355fc230c6a8559cc333ca46d320219b5ad6546314ae15
|