Scan your AI's vector database for exposed sensitive data.
Project description
RAGLeakGuard
Scan your AI's vector database for exposed sensitive data — before it becomes a breach you can't delete.
RAGLeakGuard is a CLI that connects to your vector store (Chroma today; more soon), reads what's stored, detects sensitive data (PII, health, financial), and writes a risk-scored report. No changes to your app — point it at the store and scan.
What it is: a data-inventory & compliance scanner — it answers the question a compliance officer actually asks: "what regulated data is sitting in our vector store, and can we prove we can delete it?" Read-only; safe to run against production.
What it isn't: a red-team tool. It doesn't fire prompt-injection or jailbreak attacks — it audits the data at rest, not how the model responds under attack.
🚧 Early development — building in public. Not production-ready yet.
Why this matters
RAG systems embed your private data into vector databases. That data can be reconstructed from the vectors (embedding inversion), is hard to delete (backups, replicas, caches, fine-tuned models), and usually isn't inventoried. RAGLeakGuard finds it.
Install (from source)
git clone https://github.com/Agenvana/RAGLeakGuard.git
cd RAGLeakGuard
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip # fresh venvs ship an old pip; the editable install needs a newer one
pip install -e ".[chroma,detect,dev]"
python -m spacy download en_core_web_sm
Python 3.9 note: dependencies are pinned (
spaCy<3.8,numpy<2) so prebuilt wheels are used — no source build needed.
Quickstart (≈2 minutes)
# 1. Create a test vector store full of FAKE sensitive records
python scripts/seed_synthetic.py # -> ./sample_store (100 fake clinic records)
# 2. Scan it — global + US recognisers are on by default
ragleakguard scan --source chroma --path ./sample_store --report report.md
# 3. The fixture is Australian, so add the AU locale pack for full coverage
ragleakguard scan --source chroma --path ./sample_store --locale au --report report.md
# 4. Open report.md (summary, findings by type + severity, risk level, remediation)
Detection
- Default: global + US recognisers — SSN, bank number, driver license, credit card, email, phone, names, locations, dates, IP, crypto…
- Locale packs (
--locale):au(Medicare / TFN / ABN),uk,sg,in— opt-in country IDs.
Roadmap
See ROADMAP.md — next up includes a custom AU phone recogniser, more connectors (Pinecone, pgvector), and the Fix/Prove layers.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragleakguard-0.1.0.tar.gz.
File metadata
- Download URL: ragleakguard-0.1.0.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2dadb025fe65e7e2582b74356dc773399dda524488e27d872418f5b6e225de3
|
|
| MD5 |
8370d23bac46bbd5a242a1592fbeb34e
|
|
| BLAKE2b-256 |
5073a4757d9532850a67752f9562c2af7bce4bdfc14f34ab2900a2e4bf5d23bb
|
File details
Details for the file ragleakguard-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ragleakguard-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ba250dd2a70b9666b02dbc6b18bb2d735dfd10ab5f489498b1be4699c9c2d2a
|
|
| MD5 |
7c28dd176bd0e58b16a60e7c191bb5c8
|
|
| BLAKE2b-256 |
3afd729fba78de1447212295c791cfe2f4cf6d3a3737edc520c84a6b4a8dba65
|