The inference boundary layer between your data and outbound AI requests
Project description
datagate-llm
The inference boundary layer between your data and outbound AI requests.
Scan text for sensitive data — PII, secrets, credentials, and sector-specific identifiers — before it leaves your system and reaches an LLM API.
The Problem
In 2023, Samsung engineers accidentally leaked proprietary source code and internal meeting notes by pasting them into ChatGPT. The data was retained and potentially used for training. This is not a hypothetical risk — it is the default behavior when you send unrestricted text to an external AI model.
datagate-llm is the layer you put in front of that API call. It checks what you are about to send, tells you what it found, and lets you decide: flag it, redact it, or block it.
Install
pip install datagate-llm
Zero dependencies. Python 3.9+. Works offline.
Quickstart
from datagate_llm import scan
# Basic scan
result = scan("Contact Alice at alice@company.com or call 415-555-0192")
print(result["safe"]) # False
print(result["risk_score"]) # 0.8 (or similar)
print(result["findings"]) # list of matched spans
# Redact mode — replace PII before sending to an LLM
result = scan(
"My SSN is 123-45-6789 and card number 4111111111111111",
mode="redact"
)
print(result["redacted_text"])
# "My SSN is [REDACTED:universal/ssn] and card number [REDACTED:universal/credit_card]"
# Block mode — hard stop on high-risk content
result = scan("AKIAIOSFODNN7EXAMPLEKEY", sectors=["technology"], mode="block")
if result["action"] == "block":
raise ValueError("Refusing to send credentials to LLM")
# Multi-sector scan
result = scan(
"Patient MRN: AB12345, account 123456789012",
sectors=["healthcare", "finance"]
)
for finding in result["findings"]:
print(finding["rule_id"], finding["severity"], finding["confidence"])
What It Detects
| Category | Rule ID | Severity |
|---|---|---|
| Email address | universal/email |
high |
| US phone number | universal/phone_us |
medium |
| Social Security Number | universal/ssn |
critical |
| Credit card number | universal/credit_card |
critical |
| IP address | universal/ip_address |
low |
| AWS access key | technology/aws_access_key |
critical |
| OpenAI API key | technology/openai_key |
critical |
| Anthropic API key | technology/anthropic_key |
critical |
| GitHub token | technology/github_token |
critical |
| Stripe key | technology/stripe_key |
critical |
| JWT token | technology/jwt_token |
high |
| Private key (PEM) | technology/private_key |
critical |
| Database connection string | technology/connection_string |
critical |
| NPI number | healthcare/npi_number |
high |
| ICD-10 diagnosis code | healthcare/icd10_code |
medium |
| Insurance member ID | healthcare/insurance_member_id |
high |
| Medical record number | healthcare/medical_record_number |
critical |
| DEA number | healthcare/dea_number |
critical |
| IBAN | finance/iban |
high |
| SWIFT/BIC code | finance/swift_bic |
medium |
| ABA routing number | finance/routing_number |
high |
| Bank account number | finance/bank_account |
high |
| Tax ID / EIN | finance/tax_id_ein |
critical |
| Bitcoin address | finance/crypto_btc |
medium |
| Ethereum address | finance/crypto_eth |
medium |
How It Works
text input
│
▼
tokenize() ← NFKC normalization, zero-width char removal
│
▼
match() ← regex scan against compiled rule set
│
▼
score() ← context-aware confidence (boost / suppress words)
│
▼
resolve() ← remove overlapping spans, keep highest confidence
│
▼
aggregate() ← single risk_score in [0.0, 1.0]
│
▼
build_result() ← assemble final dict with action, findings, fingerprint
Every step is a pure function. No network calls. No disk writes. No global state except the in-process rule cache.
Scan Modes
| Mode | When risk > 0 | Use case |
|---|---|---|
flag (default) |
action = "flag" |
Log and review before sending |
redact |
action = "flag", spans replaced in redacted_text |
Strip PII, send cleaned text |
block |
action = "block" |
Hard stop — raise an error upstream |
Honest Limits
- Regex-only: datagate-llm uses deterministic pattern matching. It will not catch PII embedded in obfuscated prose, paraphrased content, or novel formats it has never seen.
- English-centric: Phone and ID patterns currently target US formats. International variants may be missed.
- No semantic understanding: "The patient's temperature was 98.6" will not be flagged as health data because there is no pattern for it. Semantic scanning requires the optional
onnxruntimelayer (not yet released). - False positives are possible: Short patterns like SWIFT codes can match arbitrary uppercase strings. Use
context.suppresswords in your rule JSON to reduce noise. - Not a compliance tool: Passing a scan does not mean a document is HIPAA, GDPR, or PCI-DSS compliant. Use this as one layer of defense, not the only one.
Contributing
See CONTRIBUTING.md. In short: add rules in JSON, add tests, open a PR.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datagate_llm-0.1.0.tar.gz.
File metadata
- Download URL: datagate_llm-0.1.0.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8372c3ed289d11d10c48282dda71a028025f46b52b615449e674c92be63c191c
|
|
| MD5 |
b1d269934f66cc3931c7722172b1b798
|
|
| BLAKE2b-256 |
57f30a5963cde7a0f7e2e9f24e8e6439db712f31fe626f77412eb0ec509dbd17
|
Provenance
The following attestation bundles were made for datagate_llm-0.1.0.tar.gz:
Publisher:
publish.yml on PreethiAndichamy342/datagate-llm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datagate_llm-0.1.0.tar.gz -
Subject digest:
8372c3ed289d11d10c48282dda71a028025f46b52b615449e674c92be63c191c - Sigstore transparency entry: 1973930973
- Sigstore integration time:
-
Permalink:
PreethiAndichamy342/datagate-llm@f2cc4eb2bc4b8c1956c1ede878313a2e25e99bef -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/PreethiAndichamy342
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f2cc4eb2bc4b8c1956c1ede878313a2e25e99bef -
Trigger Event:
release
-
Statement type:
File details
Details for the file datagate_llm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datagate_llm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ad3955409838b45201e1a54c8554c837d3e10ec7838436b7a23dc1d2652b923
|
|
| MD5 |
5e7c7a3fcdd94af72cc8512c51d362c8
|
|
| BLAKE2b-256 |
a220be4ed04a399cbd6e18a53d9df51f204588085cd73b1ece9798079b6dd2e0
|
Provenance
The following attestation bundles were made for datagate_llm-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on PreethiAndichamy342/datagate-llm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datagate_llm-0.1.0-py3-none-any.whl -
Subject digest:
2ad3955409838b45201e1a54c8554c837d3e10ec7838436b7a23dc1d2652b923 - Sigstore transparency entry: 1973931108
- Sigstore integration time:
-
Permalink:
PreethiAndichamy342/datagate-llm@f2cc4eb2bc4b8c1956c1ede878313a2e25e99bef -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/PreethiAndichamy342
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f2cc4eb2bc4b8c1956c1ede878313a2e25e99bef -
Trigger Event:
release
-
Statement type: