Skip to main content

Privacy-aware query classification and routing for RAG systems - the protected space where your knowledge stays yours

Project description

Lacuna

Protected space for data governance, lineage, and privacy-aware operations


License: Apache 2.0 Build Test Validate Package codecov


FOSSA Status FOSSA Status


The Problem

Organizations deploying local LLMs and data platforms face a critical challenge: How do you enable self-service data access while maintaining governance, lineage tracking, and compliance?

Current solutions require choosing between:

  • Strict centralized control → Bottlenecks, slow innovation
  • Complete self-service → Compliance violations, data leaks, audit failures

Lacuna solves this by creating a "protected space" where:

  • Users see what they're doing in real-time
  • Central teams define policies as code
  • Systems automatically classify and route data operations
  • Complete audit trails satisfy ISO 27001/27002
  • Lineage and provenance are captured automatically

The Solution

Lacuna is a policy-aware data governance engine that:

  1. Classifies data operations automatically using a three-layer pipeline (heuristics → embeddings → LLM)
  2. Enforces policies in real-time with clear, actionable feedback to users
  3. Tracks complete lineage across transformations, joins, and exports
  4. Captures comprehensive provenance (who, what, when, why, how)
  5. Maintains ISO 27001-compliant audit logs with tamper-evident hash chains
  6. Integrates with existing tools (dbt, Databricks, Snowflake, OPA)

Core Use Cases

Use Case 1: Real-Time Policy Enforcement

Scenario: Data analyst attempting to export customer data

# User's notebook
import pandas as pd

customers = pd.read_csv("customers.csv")  
# ✓ Lacuna detects: PII data loaded, context updated

analysis = customers.merge(sales, on="customer_id")
# ✓ Lacuna classifies: PII propagates through join

analysis.to_csv("~/Downloads/export.csv")
# ✗ Lacuna blocks with clear message:
"""
❌ Governance Policy Violation

Action: Export to ~/Downloads/export.csv
Reason: Cannot export PII data to unmanaged location
Classification: PROPRIETARY (inherited from customers.csv)
Tags: PII, GDPR, FINANCIAL

Alternatives:
1. Use anonymized version: analysis_anon = anonymize(analysis, ['customer_id', 'email'])
2. Save to governed location: analysis.to_csv("/governed/workspace/analysis.csv")
3. Request exception: https://governance.example.com/exception

Policy: P-2024-001 (PII Export Restrictions)
Steward: data-governance@example.com
"""

Use Case 2: Automated Lineage Tracking

Scenario: Understanding data dependencies

from lacuna import LineageTracker

# Query lineage
lineage = LineageTracker.get_lineage("analysis.csv")

print(lineage.to_graph())
"""
analysis.csv (PROPRIETARY, tags: PII, GDPR, FINANCIAL)
├─ customers.csv (PROPRIETARY, tags: PII, GDPR)
│  └─ raw.customer_master (PROPRIETARY, tags: PII)
│     └─ salesforce.contacts (PROPRIETARY, tags: PII)
└─ sales.csv (INTERNAL, tags: FINANCIAL)
   └─ raw.transactions (INTERNAL, tags: FINANCIAL)
"""

# Check downstream impact
downstream = LineageTracker.get_downstream("customers.csv")
print(f"Changing customers.csv will impact {len(downstream)} artifacts")

Use Case 3: ISO 27001 Audit Compliance

Scenario: Annual compliance audit

from lacuna.audit import ComplianceReporter

# Generate ISO 27001 A.9.4 report (Access Control)
report = ComplianceReporter.generate_a_9_4_report(
    start_date="2025-01-01",
    end_date="2025-12-31"
)

# Report includes:
# - All data access attempts (successful and failed)
# - Classification decisions with reasoning
# - Policy violations with user responses
# - Administrative actions
# - Complete audit trail with hash chain verification

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    User Data Operation                      │
│  (read, write, join, export, transform, query)              │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ↓
┌─────────────────────────────────────────────────────────────┐
│              Operation Interceptor Layer                    │
│  • File system operations (FUSE)                            │
│  • Database queries (SQLAlchemy middleware)                 │
│  • Notebook operations (IPython magic)                      │
│  • dbt runs (post-hooks)                                    │
│  • API calls (proxy)                                        │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ↓
┌─────────────────────────────────────────────────────────────┐
│           Three-Layer Classification Pipeline               │
│                                                             │
│  Layer 1: Heuristics (<1ms)                                 │
│  ├─ Regex patterns for known sensitive terms                │
│  ├─ File path analysis                                      │
│  └─ Handles 90% of operations                               │
│                                                             │
│  Layer 2: Embeddings (<10ms)                                │
│  ├─ Semantic similarity to known examples                   │
│  ├─ Pre-computed embeddings                                 │
│  └─ Handles 8% of operations                                │
│                                                             │
│  Layer 3: LLM Reasoning (<200ms)                            │
│  ├─ Complex context-dependent decisions                     │
│  ├─ Multi-source lineage inference                          │
│  └─ Handles 2% of operations                                │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ↓
┌─────────────────────────────────────────────────────────────┐
│              Lineage & Provenance Engine                    │
│  • Track source → transformation → destination              │
│  • Classify derived data (inheritance rules)                │
│  • Tag propagation through operations                       │
│  • Business context capture                                 │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ↓
┌─────────────────────────────────────────────────────────────┐
│            Policy Engine (OPA Integration)                  │
│  • Evaluate operation against policies                      │
│  • Consider: data tier, user role, destination, purpose     │
│  • Return: allow/deny + reasoning + alternatives            │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ↓
┌─────────────────────────────────────────────────────────────┐
│            ISO 27001 Audit Logging                          │
│  • Tamper-evident hash chain                                │
│  • Complete provenance (who, what, when, why, how)          │
│  • PostgreSQL append-only storage                           │
│  • Real-time alerting for violations                        │
│  • Compliance report generation                             │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ↓
┌─────────────────────────────────────────────────────────────┐
│              User Feedback Interface                        │
│  • Inline notebook warnings                                 │
│  • IDE integration (VS Code, PyCharm)                       │
│  • CLI pre-execution checks                                 │
│  • Web dashboard for compliance status                      │
└─────────────────────────────────────────────────────────────┘

Sensitivity Tiers

Lacuna classifies all data into three tiers:

PROPRIETARY

  • Definition: Data that would provide competitive advantage or violate confidentiality if disclosed
  • Examples: Customer PII, proprietary algorithms, internal pricing, strategic plans
  • Routing: Local only, requires approval for export
  • Retention: 7+ years for compliance

INTERNAL

  • Definition: Data that should remain within organization but isn't competitively sensitive
  • Examples: Internal tooling, team processes, general analytics
  • Routing: Internal systems, no external sharing
  • Retention: 1-3 years

PUBLIC

  • Definition: Information that is or could be publicly available
  • Examples: Public documentation, open-source code, published research
  • Routing: No restrictions
  • Retention: 1 year minimum

Key principle: Classification propagates through lineage. Joining PUBLIC + PROPRIETARY = PROPRIETARY.

Key Features

Governance & Classification

  • Automatic data classification using three-layer pipeline (heuristics, embeddings, LLM)
  • Context-aware decisions considering conversation, files, lineage
  • Policy-as-code using Open Policy Agent (OPA)
  • User override with feedback loop for continuous improvement

Lineage & Provenance

  • Automatic lineage tracking across file operations, SQL queries, transformations
  • Classification inheritance through joins, aggregations, derivations
  • Tag propagation (PII, PHI, FINANCIAL) through data flows
  • Business context capture (purpose, justification, approvals)

Audit & Compliance

  • ISO 27001-compliant logging with tamper-evident hash chains
  • Complete provenance (who, what, when, why, how)
  • Real-time alerting for policy violations and security events
  • Compliance reports (A.9.4, A.12.4, GDPR, HIPAA)
  • 7-year retention with automated archival to cold storage

Integration & Extensibility

  • Pluggable architecture for custom classifiers and policies
  • Native integrations: dbt, Databricks Unity Catalog, Snowflake, OPA
  • Developer tools: Jupyter magic, VS Code extension, CLI
  • REST API for custom integrations

Performance

  • <10ms classification for 98% of operations (heuristics + embeddings)
  • Caching layer for repeated patterns
  • Asynchronous audit logging (non-blocking)
  • Batch processing for bulk operations

Quick Start

Development Mode

The fastest way to try Lacuna locally:

# Clone and install
git clone https://github.com/witlox/lacuna.git
cd lacuna
pip install -e .

# Start in dev mode (uses SQLite, no external dependencies)
lacuna dev

# Open in browser
# API Docs: http://127.0.0.1:8000/docs
# User Dashboard: http://127.0.0.1:8000/user/dashboard
# Admin Dashboard: http://127.0.0.1:8000/admin/

Dev mode uses lightweight backends (SQLite, in-memory cache) so you can explore Lacuna without setting up PostgreSQL, Redis, or OPA.

Production Mode

For production deployments with full features:

# Using Docker
docker pull ghcr.io/witlox/lacuna:latest
docker run -d -p 8000:8000 ghcr.io/witlox/lacuna:latest

# Or install via pip
pip install lacuna
lacuna serve --host 0.0.0.0 --port 8000

See Deployment Guide for details, or use the production-ready configurations:

# Docker Compose production stack
docker compose -f deploy/docker/docker-compose.prod.yaml up -d

# High-availability with PostgreSQL replication
docker compose -f deploy/docker/docker-compose.ha.yaml up -d

# Kubernetes with Helm
helm install lacuna ./deploy/helm/lacuna -f deploy/helm/lacuna/values-production.yaml

Documentation

Examples

The examples/ directory contains runnable scripts demonstrating Lacuna features:

Example Description
basic_classification.py Classify data and check sensitivity tiers
policy_evaluation.py Evaluate operations against policies
lineage_tracking.py Track data lineage and provenance
audit_logging.py Query and inspect audit logs
api_client.py HTTP client for the REST API
batch_classification.py Classify multiple items efficiently
custom_classifier.py Create custom classification rules
governance_workflow.py Complete governance workflow
# Run examples after starting dev server
lacuna dev &
python examples/basic_classification.py

Why Lacuna?

The Name

Lacuna (Latin): A gap, cavity, or protected space

In anatomy, a lacuna is a small cavity in bone or cartilage that protects cells. In manuscripts, a lacuna is a missing section that reveals what's intentionally kept private.

In data governance, Lacuna creates the protected space where:

  • Sensitive data stays secure (within the cavity)
  • Appropriate data flows freely (through the controlled gap)
  • The boundary is enforced automatically (by classification and policy)

The Market Gap

Existing solutions address either:

  1. Data catalogs (Alation, Collibra) - Passive metadata, no real-time enforcement
  2. Access control (Databricks, Snowflake) - Permission gates, but no operation-level governance
  3. DLP tools (Microsoft Purview) - Detection only, limited lineage
  4. Policy engines (OPA) - Enforcement infrastructure, but no data-aware classification

Lacuna uniquely combines:

  • Real-time operation interception
  • Automatic data classification with lineage
  • Policy enforcement with user feedback
  • ISO 27001-compliant audit logging
  • Self-service model with central governance

Who This Is For

Target Organizations:

  • Enterprises with data governance requirements
  • Regulated industries (finance, healthcare, government)
  • Companies with proprietary data assets
  • Organizations deploying local data platforms
  • Teams needing self-service with compliance

Target Users:

  • Data analysts (need self-service access)
  • Data engineers (building pipelines)
  • Data governance teams (defining policies)
  • Compliance officers (generating audit reports)
  • Security teams (monitoring access)

Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • How to set up development environment
  • Code style guidelines
  • Testing requirements
  • Plugin development guide
  • Documentation standards

License

Lacuna is licensed under the Apache 2.0.

Support

Citation

If you use Lacuna in academic research, please cite:

@software{lacuna2025,
  title = {Lacuna: Self-Service Data Governance with Real-Time Policy Enforcement},
  author = {Lacuna Contributors},
  year = {2025},
  url = {https://github.com/witlox/lacuna}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lacuna-2026.1.51.tar.gz (84.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lacuna-2026.1.51-py3-none-any.whl (98.5 kB view details)

Uploaded Python 3

File details

Details for the file lacuna-2026.1.51.tar.gz.

File metadata

  • Download URL: lacuna-2026.1.51.tar.gz
  • Upload date:
  • Size: 84.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lacuna-2026.1.51.tar.gz
Algorithm Hash digest
SHA256 430fedab2875335ac264f7276f75671586dafdb1751f52aab3ecf7f3da7aa088
MD5 caa921888e760bad0e49f972328bb7fe
BLAKE2b-256 30049b06209b9f1550c7f7bb734d4f66ab4324bb0da1726e06a645a71ff3976b

See more details on using hashes here.

Provenance

The following attestation bundles were made for lacuna-2026.1.51.tar.gz:

Publisher: package.yml on witlox/lacuna

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lacuna-2026.1.51-py3-none-any.whl.

File metadata

  • Download URL: lacuna-2026.1.51-py3-none-any.whl
  • Upload date:
  • Size: 98.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lacuna-2026.1.51-py3-none-any.whl
Algorithm Hash digest
SHA256 f5bde0a63488f4eafd45b08628c62da7b5460d8cd3f2176b15b4cc7ebb028425
MD5 5c2b45752f05bc4ad32af1ab86d8f1f4
BLAKE2b-256 f9cd30b686bfe067bb337e88587fa4be0c257d61fd32e0cf1ca37453948c2581

See more details on using hashes here.

Provenance

The following attestation bundles were made for lacuna-2026.1.51-py3-none-any.whl:

Publisher: package.yml on witlox/lacuna

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page