Skip to main content

Production-grade Python toolkit for building GDPR-compliant RAG systems with automatic PII detection, audit trails, and compliance validation

Project description

GDPR Safe RAG

Python 3.10+ License: MIT

Production-grade Python toolkit for GDPR-compliant RAG (Retrieval-Augmented Generation) systems with automatic PII detection, audit trails, and compliance validation.

Features

  • PII Detection: Automatic detection of personal identifiable information with support for UK, EU, and common patterns
  • Redaction Strategies: Multiple redaction approaches (token, hash, category) for different use cases
  • Audit Logging: Comprehensive audit trails with PostgreSQL or SQLite backends
  • Compliance Checking: Built-in GDPR compliance validation with detailed reports
  • Async-First: Full async/await support for modern Python applications
  • Type Safe: Complete type hints with mypy strict mode compatibility

Quick Start

Installation

pip install gdpr-safe-rag

Or install from source:

git clone https://github.com/Charlescifix/gdpr-safe-rag.git
cd gdpr-safe-rag
pip install -e .

Basic PII Detection

from gdpr_safe_rag import PIIDetector

# Initialize detector with UK patterns
detector = PIIDetector(region="UK", detection_level="strict")

# Sample text with PII
text = """
Contact John Smith at john.smith@example.com or call 07700 900123.
His NHS number is 123 456 7890.
"""

# Detect PII
items = detector.detect(text)
for item in items:
    print(f"{item.type}: {item.value} (confidence: {item.confidence:.2f})")

# Redact PII
result = detector.redact(text)
print(result.redacted_text)
# Output: Contact John Smith at [EMAIL_1] or call [PHONE_1].
#         His NHS number is [NHS_NUMBER_1].

# Get mapping for potential restoration
print(result.mapping)
# Output: {'[EMAIL_1]': 'john.smith@example.com', ...}

Audit Logging

import asyncio
from gdpr_safe_rag import AuditLogger

async def main():
    async with AuditLogger(storage_path="audit.db") as logger:
        # Log document ingestion
        await logger.log_ingestion(
            document_id="doc-001",
            user_id="admin",
            pii_detected=True,
            pii_count=5,
        )

        # Log user query
        await logger.log_query(
            user_id="user-123",
            query_text="What is the company policy?",
            retrieved_docs=["doc-001", "doc-002"],
        )

        # Log data deletion (right to erasure)
        await logger.log_deletion(
            user_id="user-456",
            resource="user-456-data",
            reason="user_request",
        )

        # Export compliance report
        from datetime import datetime, timedelta
        report = await logger.export_compliance_report(
            start_date=datetime.now() - timedelta(days=30),
            end_date=datetime.now(),
        )
        print(report)

asyncio.run(main())

Compliance Checking

import asyncio
from gdpr_safe_rag import ComplianceChecker, AuditLogger

async def main():
    # Sample documents with metadata
    documents = [
        {
            "id": "doc-001",
            "created_at": datetime.now() - timedelta(days=30),
            "pii_detected": True,
            "pii_count": 5,
        },
    ]

    async with AuditLogger(storage_path="audit.db") as logger:
        checker = ComplianceChecker(retention_days=2555)
        report = await checker.run_all_checks(
            documents=documents,
            audit_logger=logger,
        )

        print(report.to_text())
        # Shows detailed compliance status with remediation suggestions

asyncio.run(main())

Supported PII Types

UK Patterns

  • Email addresses
  • Phone numbers (mobile and landline)
  • UK postcodes
  • NHS numbers (with checksum validation)
  • National Insurance numbers
  • Credit card numbers (with Luhn validation)
  • IBAN (with modulo 97 validation)

EU Patterns

  • Email addresses
  • Phone numbers
  • Credit card numbers
  • IBAN

Configuration

Configuration can be set via environment variables or .env file:

# Database
DATABASE_URL=postgresql+asyncpg://user:pass@localhost:5432/gdpr_rag

# PII Detection
PII_DETECTION_LEVEL=strict  # strict, moderate, lenient

# Audit Settings
AUDIT_RETENTION_DAYS=2555  # ~7 years

Development

Setup

# Clone repository
git clone https://github.com/Charlescifix/gdpr-safe-rag.git
cd gdpr-safe-rag

# Install development dependencies
pip install -e ".[dev]"

# Start PostgreSQL for testing (optional)
docker-compose up -d postgres

Running Tests

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=gdpr_safe_rag --cov-report=html

# Run type checking
mypy gdpr_safe_rag --strict

# Run linting
ruff check gdpr_safe_rag
black --check gdpr_safe_rag

Running Examples

# PII Detection (no database needed)
python examples/basic_pii_detection.py

# Audit Logging
python examples/audit_logging_postgres.py

# Compliance Check
python examples/compliance_check.py

Architecture

gdpr_safe_rag/
├── pii_detector/        # PII detection and redaction
│   ├── detector.py      # Main PIIDetector class
│   ├── patterns/        # Pattern definitions by region
│   ├── redactor.py      # Redaction strategies
│   └── validators.py    # Checksum validators
├── audit_logger/        # Audit trail functionality
│   ├── logger.py        # Main AuditLogger class
│   ├── backends/        # Storage backends (PostgreSQL, SQLite, Memory)
│   └── exporters.py     # Report export functionality
└── compliance_checker/  # GDPR compliance validation
    ├── checker.py       # Main ComplianceChecker class
    └── checks/          # Individual compliance checks

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdpr_safe_rag-0.1.0.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gdpr_safe_rag-0.1.0-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file gdpr_safe_rag-0.1.0.tar.gz.

File metadata

  • Download URL: gdpr_safe_rag-0.1.0.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for gdpr_safe_rag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f4ed3136d211e746ae039800365e05a661d211786adb037018b9baf344b03729
MD5 92f1ee8adff2b5c6d656d742f3e154e9
BLAKE2b-256 ba2f6a7e67568c0eba5cca6f98f3d97a675b09c9da4c73bab8e761b7904e3f6f

See more details on using hashes here.

File details

Details for the file gdpr_safe_rag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gdpr_safe_rag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for gdpr_safe_rag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c3b16ba3d8e24a30caad1118a59f9f7b3a216a44f0565c2381e69a8ec7655ec8
MD5 edcd8e8b747b4650853c40e25d6cebea
BLAKE2b-256 1a268438f947ff2bcb355fc230c6a8559cc333ca46d320219b5ad6546314ae15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page