Skip to main content

Scan your AI's vector database for exposed sensitive data.

Project description

RAGLeakGuard

Scan your AI's vector database for exposed sensitive data — before it becomes a breach you can't delete.

RAGLeakGuard is a CLI that connects to your vector store (Chroma today; more soon), reads what's stored, detects sensitive data (PII, health, financial), and writes a risk-scored report. No changes to your app — point it at the store and scan.

What it is: a data-inventory & compliance scanner — it answers the question a compliance officer actually asks: "what regulated data is sitting in our vector store, and can we prove we can delete it?" Read-only; safe to run against production.

What it isn't: a red-team tool. It doesn't fire prompt-injection or jailbreak attacks — it audits the data at rest, not how the model responds under attack.

🚧 Early development — building in public. Not production-ready yet.

Why this matters

RAG systems embed your private data into vector databases. That data can be reconstructed from the vectors (embedding inversion), is hard to delete (backups, replicas, caches, fine-tuned models), and usually isn't inventoried. RAGLeakGuard finds it.

Install (from source)

git clone https://github.com/Agenvana/RAGLeakGuard.git
cd RAGLeakGuard
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip          # fresh venvs ship an old pip; the editable install needs a newer one
pip install -e ".[chroma,detect,dev]"
python -m spacy download en_core_web_sm

Python 3.9 note: dependencies are pinned (spaCy<3.8, numpy<2) so prebuilt wheels are used — no source build needed.

Quickstart (≈2 minutes)

# 1. Create a test vector store full of FAKE sensitive records
python scripts/seed_synthetic.py                          # -> ./sample_store (100 fake clinic records)

# 2. Scan it — global + US recognisers are on by default
ragleakguard scan --source chroma --path ./sample_store --report report.md

# 3. The fixture is Australian, so add the AU locale pack for full coverage
ragleakguard scan --source chroma --path ./sample_store --locale au --report report.md

# 4. Open report.md  (summary, findings by type + severity, risk level, remediation)

Detection

  • Default: global + US recognisers — SSN, bank number, driver license, credit card, email, phone, names, locations, dates, IP, crypto…
  • Locale packs (--locale): au (Medicare / TFN / ABN), uk, sg, in — opt-in country IDs.

Roadmap

See ROADMAP.md — next up includes a custom AU phone recogniser, more connectors (Pinecone, pgvector), and the Fix/Prove layers.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragleakguard-0.1.0.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragleakguard-0.1.0-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file ragleakguard-0.1.0.tar.gz.

File metadata

  • Download URL: ragleakguard-0.1.0.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for ragleakguard-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a2dadb025fe65e7e2582b74356dc773399dda524488e27d872418f5b6e225de3
MD5 8370d23bac46bbd5a242a1592fbeb34e
BLAKE2b-256 5073a4757d9532850a67752f9562c2af7bce4bdfc14f34ab2900a2e4bf5d23bb

See more details on using hashes here.

File details

Details for the file ragleakguard-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ragleakguard-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for ragleakguard-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ba250dd2a70b9666b02dbc6b18bb2d735dfd10ab5f489498b1be4699c9c2d2a
MD5 7c28dd176bd0e58b16a60e7c191bb5c8
BLAKE2b-256 3afd729fba78de1447212295c791cfe2f4cf6d3a3737edc520c84a6b4a8dba65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page