Privacy-aware query classification and routing for RAG systems - the protected space where your knowledge stays yours
Project description
Lacuna
Protected space for data governance, lineage, and privacy-aware operations
The Problem
Organizations deploying local LLMs and data platforms face a critical challenge: How do you enable self-service data access while maintaining governance, lineage tracking, and compliance?
Current solutions require choosing between:
- Strict centralized control → Bottlenecks, slow innovation
- Complete self-service → Compliance violations, data leaks, audit failures
Lacuna solves this by creating a "protected space" where:
- Users see what they're doing in real-time
- Central teams define policies as code
- Systems automatically classify and route data operations
- Complete audit trails satisfy ISO 27001/27002
- Lineage and provenance are captured automatically
The Solution
Lacuna is a policy-aware data governance engine that:
- Classifies data operations automatically using a three-layer pipeline (heuristics → embeddings → LLM)
- Enforces policies in real-time with clear, actionable feedback to users
- Tracks complete lineage across transformations, joins, and exports
- Captures comprehensive provenance (who, what, when, why, how)
- Maintains ISO 27001-compliant audit logs with tamper-evident hash chains
- Integrates with existing tools (dbt, Databricks, Snowflake, OPA)
Core Use Cases
Use Case 1: Real-Time Policy Enforcement
Scenario: Data analyst attempting to export customer data
# User's notebook
import pandas as pd
customers = pd.read_csv("customers.csv")
# ✓ Lacuna detects: PII data loaded, context updated
analysis = customers.merge(sales, on="customer_id")
# ✓ Lacuna classifies: PII propagates through join
analysis.to_csv("~/Downloads/export.csv")
# ✗ Lacuna blocks with clear message:
"""
❌ Governance Policy Violation
Action: Export to ~/Downloads/export.csv
Reason: Cannot export PII data to unmanaged location
Classification: PROPRIETARY (inherited from customers.csv)
Tags: PII, GDPR, FINANCIAL
Alternatives:
1. Use anonymized version: analysis_anon = anonymize(analysis, ['customer_id', 'email'])
2. Save to governed location: analysis.to_csv("/governed/workspace/analysis.csv")
3. Request exception: https://governance.example.com/exception
Policy: P-2024-001 (PII Export Restrictions)
Steward: data-governance@example.com
"""
Use Case 2: Automated Lineage Tracking
Scenario: Understanding data dependencies
from lacuna import LineageTracker
# Query lineage
lineage = LineageTracker.get_lineage("analysis.csv")
print(lineage.to_graph())
"""
analysis.csv (PROPRIETARY, tags: PII, GDPR, FINANCIAL)
├─ customers.csv (PROPRIETARY, tags: PII, GDPR)
│ └─ raw.customer_master (PROPRIETARY, tags: PII)
│ └─ salesforce.contacts (PROPRIETARY, tags: PII)
└─ sales.csv (INTERNAL, tags: FINANCIAL)
└─ raw.transactions (INTERNAL, tags: FINANCIAL)
"""
# Check downstream impact
downstream = LineageTracker.get_downstream("customers.csv")
print(f"Changing customers.csv will impact {len(downstream)} artifacts")
Use Case 3: ISO 27001 Audit Compliance
Scenario: Annual compliance audit
from lacuna.audit import ComplianceReporter
# Generate ISO 27001 A.9.4 report (Access Control)
report = ComplianceReporter.generate_a_9_4_report(
start_date="2025-01-01",
end_date="2025-12-31"
)
# Report includes:
# - All data access attempts (successful and failed)
# - Classification decisions with reasoning
# - Policy violations with user responses
# - Administrative actions
# - Complete audit trail with hash chain verification
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ User Data Operation │
│ (read, write, join, export, transform, query) │
└───────────────────────────┬─────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Operation Interceptor Layer │
│ • File system operations (FUSE) │
│ • Database queries (SQLAlchemy middleware) │
│ • Notebook operations (IPython magic) │
│ • dbt runs (post-hooks) │
│ • API calls (proxy) │
└───────────────────────────┬─────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Three-Layer Classification Pipeline │
│ │
│ Layer 1: Heuristics (<1ms) │
│ ├─ Regex patterns for known sensitive terms │
│ ├─ File path analysis │
│ └─ Handles 90% of operations │
│ │
│ Layer 2: Embeddings (<10ms) │
│ ├─ Semantic similarity to known examples │
│ ├─ Pre-computed embeddings │
│ └─ Handles 8% of operations │
│ │
│ Layer 3: LLM Reasoning (<200ms) │
│ ├─ Complex context-dependent decisions │
│ ├─ Multi-source lineage inference │
│ └─ Handles 2% of operations │
└───────────────────────────┬─────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Lineage & Provenance Engine │
│ • Track source → transformation → destination │
│ • Classify derived data (inheritance rules) │
│ • Tag propagation through operations │
│ • Business context capture │
└───────────────────────────┬─────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Policy Engine (OPA Integration) │
│ • Evaluate operation against policies │
│ • Consider: data tier, user role, destination, purpose │
│ • Return: allow/deny + reasoning + alternatives │
└───────────────────────────┬─────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ ISO 27001 Audit Logging │
│ • Tamper-evident hash chain │
│ • Complete provenance (who, what, when, why, how) │
│ • PostgreSQL append-only storage │
│ • Real-time alerting for violations │
│ • Compliance report generation │
└───────────────────────────┬─────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ User Feedback Interface │
│ • Inline notebook warnings │
│ • IDE integration (VS Code, PyCharm) │
│ • CLI pre-execution checks │
│ • Web dashboard for compliance status │
└─────────────────────────────────────────────────────────────┘
Sensitivity Tiers
Lacuna classifies all data into three tiers:
PROPRIETARY
- Definition: Data that would provide competitive advantage or violate confidentiality if disclosed
- Examples: Customer PII, proprietary algorithms, internal pricing, strategic plans
- Routing: Local only, requires approval for export
- Retention: 7+ years for compliance
INTERNAL
- Definition: Data that should remain within organization but isn't competitively sensitive
- Examples: Internal tooling, team processes, general analytics
- Routing: Internal systems, no external sharing
- Retention: 1-3 years
PUBLIC
- Definition: Information that is or could be publicly available
- Examples: Public documentation, open-source code, published research
- Routing: No restrictions
- Retention: 1 year minimum
Key principle: Classification propagates through lineage. Joining PUBLIC + PROPRIETARY = PROPRIETARY.
Key Features
Governance & Classification
- Automatic data classification using three-layer pipeline (heuristics, embeddings, LLM)
- Context-aware decisions considering conversation, files, lineage
- Policy-as-code using Open Policy Agent (OPA)
- User override with feedback loop for continuous improvement
Lineage & Provenance
- Automatic lineage tracking across file operations, SQL queries, transformations
- Classification inheritance through joins, aggregations, derivations
- Tag propagation (PII, PHI, FINANCIAL) through data flows
- Business context capture (purpose, justification, approvals)
Audit & Compliance
- ISO 27001-compliant logging with tamper-evident hash chains
- Complete provenance (who, what, when, why, how)
- Real-time alerting for policy violations and security events
- Compliance reports (A.9.4, A.12.4, GDPR, HIPAA)
- 7-year retention with automated archival to cold storage
Integration & Extensibility
- Pluggable architecture for custom classifiers and policies
- Native integrations: dbt, Databricks Unity Catalog, Snowflake, OPA
- Developer tools: Jupyter magic, VS Code extension, CLI
- REST API for custom integrations
Performance
- <10ms classification for 98% of operations (heuristics + embeddings)
- Caching layer for repeated patterns
- Asynchronous audit logging (non-blocking)
- Batch processing for bulk operations
Quick Start
Development Mode
The fastest way to try Lacuna locally:
# Clone and install
git clone https://github.com/witlox/lacuna.git
cd lacuna
pip install -e .
# Start in dev mode (uses SQLite, no external dependencies)
lacuna dev
# Open in browser
# API Docs: http://127.0.0.1:8000/docs
# User Dashboard: http://127.0.0.1:8000/user/dashboard
# Admin Dashboard: http://127.0.0.1:8000/admin/
Dev mode uses lightweight backends (SQLite, in-memory cache) so you can explore Lacuna without setting up PostgreSQL, Redis, or OPA.
Production Mode
For production deployments with full features:
# Using Docker
docker pull ghcr.io/witlox/lacuna:latest
docker run -d -p 8000:8000 ghcr.io/witlox/lacuna:latest
# Or install via pip
pip install lacuna
lacuna serve --host 0.0.0.0 --port 8000
See Deployment Guide for details, or use the production-ready configurations:
# Docker Compose production stack
docker compose -f deploy/docker/docker-compose.prod.yaml up -d
# High-availability with PostgreSQL replication
docker compose -f deploy/docker/docker-compose.ha.yaml up -d
# Kubernetes with Helm
helm install lacuna ./deploy/helm/lacuna -f deploy/helm/lacuna/values-production.yaml
Documentation
- User Guide - Using the web UI and CLI
- Architecture Overview - System design and data flow
- Development Guide - Local setup and dev mode
- Data Governance Guide - Self-service governance model
- Lineage & Provenance - Tracking data flows
- ISO 27001 Audit Logging - Compliance implementation
- Policy-as-Code - Writing OPA policies
- Integration Guide - dbt, Databricks, Snowflake
- Plugin Development - Extending Lacuna
- Deployment Guide - Production setup and authentication
Examples
The examples/ directory contains runnable scripts demonstrating Lacuna features:
| Example | Description |
|---|---|
basic_classification.py |
Classify data and check sensitivity tiers |
policy_evaluation.py |
Evaluate operations against policies |
lineage_tracking.py |
Track data lineage and provenance |
audit_logging.py |
Query and inspect audit logs |
api_client.py |
HTTP client for the REST API |
batch_classification.py |
Classify multiple items efficiently |
custom_classifier.py |
Create custom classification rules |
governance_workflow.py |
Complete governance workflow |
# Run examples after starting dev server
lacuna dev &
python examples/basic_classification.py
Why Lacuna?
The Name
Lacuna (Latin): A gap, cavity, or protected space
In anatomy, a lacuna is a small cavity in bone or cartilage that protects cells. In manuscripts, a lacuna is a missing section that reveals what's intentionally kept private.
In data governance, Lacuna creates the protected space where:
- Sensitive data stays secure (within the cavity)
- Appropriate data flows freely (through the controlled gap)
- The boundary is enforced automatically (by classification and policy)
The Market Gap
Existing solutions address either:
- Data catalogs (Alation, Collibra) - Passive metadata, no real-time enforcement
- Access control (Databricks, Snowflake) - Permission gates, but no operation-level governance
- DLP tools (Microsoft Purview) - Detection only, limited lineage
- Policy engines (OPA) - Enforcement infrastructure, but no data-aware classification
Lacuna uniquely combines:
- Real-time operation interception
- Automatic data classification with lineage
- Policy enforcement with user feedback
- ISO 27001-compliant audit logging
- Self-service model with central governance
Who This Is For
Target Organizations:
- Enterprises with data governance requirements
- Regulated industries (finance, healthcare, government)
- Companies with proprietary data assets
- Organizations deploying local data platforms
- Teams needing self-service with compliance
Target Users:
- Data analysts (need self-service access)
- Data engineers (building pipelines)
- Data governance teams (defining policies)
- Compliance officers (generating audit reports)
- Security teams (monitoring access)
Contributing
We welcome contributions! See CONTRIBUTING.md for:
- How to set up development environment
- Code style guidelines
- Testing requirements
- Plugin development guide
- Documentation standards
License
Lacuna is licensed under the Apache 2.0.
Support
- Issues: https://github.com/witlox/lacuna/issues
- Discussions: https://github.com/witlox/lacuna/discussions
Citation
If you use Lacuna in academic research, please cite:
@software{lacuna2025,
title = {Lacuna: Self-Service Data Governance with Real-Time Policy Enforcement},
author = {Lacuna Contributors},
year = {2025},
url = {https://github.com/witlox/lacuna}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lacuna-2026.1.51.tar.gz.
File metadata
- Download URL: lacuna-2026.1.51.tar.gz
- Upload date:
- Size: 84.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
430fedab2875335ac264f7276f75671586dafdb1751f52aab3ecf7f3da7aa088
|
|
| MD5 |
caa921888e760bad0e49f972328bb7fe
|
|
| BLAKE2b-256 |
30049b06209b9f1550c7f7bb734d4f66ab4324bb0da1726e06a645a71ff3976b
|
Provenance
The following attestation bundles were made for lacuna-2026.1.51.tar.gz:
Publisher:
package.yml on witlox/lacuna
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lacuna-2026.1.51.tar.gz -
Subject digest:
430fedab2875335ac264f7276f75671586dafdb1751f52aab3ecf7f3da7aa088 - Sigstore transparency entry: 836795445
- Sigstore integration time:
-
Permalink:
witlox/lacuna@82da3339db46dbfe2d79c0cb2a71092795ee29e4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/witlox
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@82da3339db46dbfe2d79c0cb2a71092795ee29e4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lacuna-2026.1.51-py3-none-any.whl.
File metadata
- Download URL: lacuna-2026.1.51-py3-none-any.whl
- Upload date:
- Size: 98.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5bde0a63488f4eafd45b08628c62da7b5460d8cd3f2176b15b4cc7ebb028425
|
|
| MD5 |
5c2b45752f05bc4ad32af1ab86d8f1f4
|
|
| BLAKE2b-256 |
f9cd30b686bfe067bb337e88587fa4be0c257d61fd32e0cf1ca37453948c2581
|
Provenance
The following attestation bundles were made for lacuna-2026.1.51-py3-none-any.whl:
Publisher:
package.yml on witlox/lacuna
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lacuna-2026.1.51-py3-none-any.whl -
Subject digest:
f5bde0a63488f4eafd45b08628c62da7b5460d8cd3f2176b15b4cc7ebb028425 - Sigstore transparency entry: 836795510
- Sigstore integration time:
-
Permalink:
witlox/lacuna@82da3339db46dbfe2d79c0cb2a71092795ee29e4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/witlox
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@82da3339db46dbfe2d79c0cb2a71092795ee29e4 -
Trigger Event:
push
-
Statement type: