Skip to main content

The inference boundary layer between your data and outbound AI requests

Project description

datagate-llm

PyPI version Python versions License: MIT Tests

The inference boundary layer between your data and outbound AI requests.

Scan text for sensitive data — PII, secrets, credentials, and sector-specific identifiers — before it leaves your system and reaches an LLM API.


The Problem

In 2023, Samsung engineers accidentally leaked proprietary source code and internal meeting notes by pasting them into ChatGPT. The data was retained and potentially used for training. This is not a hypothetical risk — it is the default behavior when you send unrestricted text to an external AI model.

datagate-llm is the layer you put in front of that API call. It checks what you are about to send, tells you what it found, and lets you decide: flag it, redact it, or block it.


Install

pip install datagate-llm

Zero dependencies. Python 3.9+. Works offline.


Quickstart

from datagate_llm import scan

# Basic scan
result = scan("Contact Alice at alice@company.com or call 415-555-0192")
print(result["safe"])        # False
print(result["risk_score"])  # 0.8 (or similar)
print(result["findings"])    # list of matched spans

# Redact mode — replace PII before sending to an LLM
result = scan(
    "My SSN is 123-45-6789 and card number 4111111111111111",
    mode="redact"
)
print(result["redacted_text"])
# "My SSN is [REDACTED:universal/ssn] and card number [REDACTED:universal/credit_card]"

# Block mode — hard stop on high-risk content
result = scan("AKIAIOSFODNN7EXAMPLEKEY", sectors=["technology"], mode="block")
if result["action"] == "block":
    raise ValueError("Refusing to send credentials to LLM")

# Multi-sector scan
result = scan(
    "Patient MRN: AB12345, account 123456789012",
    sectors=["healthcare", "finance"]
)
for finding in result["findings"]:
    print(finding["rule_id"], finding["severity"], finding["confidence"])

What It Detects

Category Rule ID Severity
Email address universal/email high
US phone number universal/phone_us medium
Social Security Number universal/ssn critical
Credit card number universal/credit_card critical
IP address universal/ip_address low
AWS access key technology/aws_access_key critical
OpenAI API key technology/openai_key critical
Anthropic API key technology/anthropic_key critical
GitHub token technology/github_token critical
Stripe key technology/stripe_key critical
JWT token technology/jwt_token high
Private key (PEM) technology/private_key critical
Database connection string technology/connection_string critical
NPI number healthcare/npi_number high
ICD-10 diagnosis code healthcare/icd10_code medium
Insurance member ID healthcare/insurance_member_id high
Medical record number healthcare/medical_record_number critical
DEA number healthcare/dea_number critical
IBAN finance/iban high
SWIFT/BIC code finance/swift_bic medium
ABA routing number finance/routing_number high
Bank account number finance/bank_account high
Tax ID / EIN finance/tax_id_ein critical
Bitcoin address finance/crypto_btc medium
Ethereum address finance/crypto_eth medium

How It Works

text input
    │
    ▼
tokenize()          ← NFKC normalization, zero-width char removal
    │
    ▼
match()             ← regex scan against compiled rule set
    │
    ▼
score()             ← context-aware confidence (boost / suppress words)
    │
    ▼
resolve()           ← remove overlapping spans, keep highest confidence
    │
    ▼
aggregate()         ← single risk_score in [0.0, 1.0]
    │
    ▼
build_result()      ← assemble final dict with action, findings, fingerprint

Every step is a pure function. No network calls. No disk writes. No global state except the in-process rule cache.


Scan Modes

Mode When risk > 0 Use case
flag (default) action = "flag" Log and review before sending
redact action = "flag", spans replaced in redacted_text Strip PII, send cleaned text
block action = "block" Hard stop — raise an error upstream

Honest Limits

  • Regex-only: datagate-llm uses deterministic pattern matching. It will not catch PII embedded in obfuscated prose, paraphrased content, or novel formats it has never seen.
  • English-centric: Phone and ID patterns currently target US formats. International variants may be missed.
  • No semantic understanding: "The patient's temperature was 98.6" will not be flagged as health data because there is no pattern for it. Semantic scanning requires the optional onnxruntime layer (not yet released).
  • False positives are possible: Short patterns like SWIFT codes can match arbitrary uppercase strings. Use context.suppress words in your rule JSON to reduce noise.
  • Not a compliance tool: Passing a scan does not mean a document is HIPAA, GDPR, or PCI-DSS compliant. Use this as one layer of defense, not the only one.

Contributing

See CONTRIBUTING.md. In short: add rules in JSON, add tests, open a PR.


License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datagate_llm-0.1.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datagate_llm-0.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file datagate_llm-0.1.0.tar.gz.

File metadata

  • Download URL: datagate_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datagate_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8372c3ed289d11d10c48282dda71a028025f46b52b615449e674c92be63c191c
MD5 b1d269934f66cc3931c7722172b1b798
BLAKE2b-256 57f30a5963cde7a0f7e2e9f24e8e6439db712f31fe626f77412eb0ec509dbd17

See more details on using hashes here.

Provenance

The following attestation bundles were made for datagate_llm-0.1.0.tar.gz:

Publisher: publish.yml on PreethiAndichamy342/datagate-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datagate_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datagate_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datagate_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2ad3955409838b45201e1a54c8554c837d3e10ec7838436b7a23dc1d2652b923
MD5 5e7c7a3fcdd94af72cc8512c51d362c8
BLAKE2b-256 a220be4ed04a399cbd6e18a53d9df51f204588085cd73b1ece9798079b6dd2e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for datagate_llm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on PreethiAndichamy342/datagate-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page