A local security tool to detect secrets and PII in LLM prompts, code, and logs.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

LLM Secrets Leak Detector

LLM Secrets Leak Detector is a security tool designed to prevent accidental exposure of sensitive data when interacting with Large Language Models (LLMs).

Modern AI development workflows frequently involve sending code snippets, configuration files, logs, and debugging output to language models. In many cases developers unintentionally include sensitive information such as API keys, database credentials, private tokens, or internal infrastructure details.

This project detects those secrets before they leave the developer’s environment.

The system scans prompts, responses, logs, and source code to identify potential secrets and warns the user when confidential data may be exposed.

Problem

LLM-assisted development has dramatically increased developer productivity. However, it also introduced a new security risk.

Developers regularly paste entire code blocks, configuration files, or logs into AI assistants to ask questions like:

“Here is my code, can you help debug it?”

These inputs often contain secrets such as:

API keys
database credentials
private tokens
authentication secrets
internal infrastructure URLs
encryption keys
JWT tokens

Example input that might be leaked:

OPENAI_API_KEY=sk-abc123
DATABASE_URL=postgres://user:pass@db
JWT_SECRET=super-secret-token

Once sent to an external LLM service, this data may:

appear in provider logs
be stored for debugging
violate compliance policies
leak sensitive infrastructure details

The exposure of secrets in software artifacts has been increasing rapidly, with millions of credentials discovered in public repositories in recent years. (arXiv)

Security teams now treat secret detection as a critical part of modern development pipelines.

Solution

LLM Secrets Leak Detector automatically scans AI interaction data and identifies potential secrets before they are transmitted.

The tool analyzes:

prompts sent to LLMs
LLM responses
application logs
code snippets
configuration files

When a potential secret is detected, the tool generates a warning describing:

the secret type
location
severity level

Example output:

⚠ Secrets detected

Type: OpenAI API Key
Location: line 3
Risk: HIGH

Type: Database credentials
Location: line 4
Risk: CRITICAL

This allows developers to remove or redact sensitive information before it reaches an external AI system.

Core Detection Approach

The detection engine uses a layered strategy similar to modern secret detection systems.

Most secret scanners rely on three complementary techniques:

Pattern Matching (Regex) Identifies secrets with known formats such as AWS keys or GitHub tokens.
Entropy Analysis Detects strings that appear random, which is typical for cryptographic tokens.
Contextual Analysis Reduces false positives by analyzing surrounding code and variable names. (gitguardian.com)

Combining these methods significantly improves accuracy.

Secret Types Detected

The scanner detects over 180 classes of sensitive data, including:

API Keys

Examples:

sk-xxxxxxxxxxxxxxxx
AIzaSyxxxxxxxxxxxx

Cloud Credentials

Examples:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AZURE_TOKEN

Version Control Tokens

Examples:

ghp_xxxxxxxxxxxxxxxxx
glpat_xxxxxxxxxxxxx

Authentication Secrets

Examples:

JWT_SECRET
SESSION_SECRET
PRIVATE_KEY

Database Credentials

Examples:

postgres://user:password@host
mysql://root:pass@db

Cryptographic Material

Examples:

-----BEGIN PRIVATE KEY-----

Feature Matrix

The LLM Secrets Leak Detector provides a comprehensive suite of features designed for security, performance, and developer experience.

Category	Feature	Status	Implementation Details
Detection Engines	Regex Matching (RE2)	✅	Primary engine using `google-re2`. Fast, linear-time matching.
	Regex Matching (Legacy)	✅	Fallback to `regex` (Python) for complex patterns. ReDoS protection.
	Entropy Analysis	✅	Shannon entropy scoring for random-looking tokens (min 20 chars).
	Contextual Heuristics	✅	Identifies secrets based on surrounding keywords like `prod`, `password`, `key`. Supports multi-lingual conversational intent matching (English, Spanish, French, German).
	Rule-based Logic	✅	1750+ rules loaded from `data/` (Expanded 2026).
Input Sources	File Scanning	✅	Scans local files with UTF-8 support. Error handling.
	Stdin / Piped Input	✅	Real-time processing of piped data (e.g., `cat log \| ./run.sh`).
	Direct Text	✅	Via `--text` flag for quick prompt validation.
	Streaming	✅	Optimized line-by-line generator for low-latency processing.
Obfuscation	Redact	✅	Masks the middle of secrets (e.g., `AKIA...CDEF`).
	Hash	✅	Consistent SHA-256 hashing (first 12 chars) for safe debugging.
	Synthetic	✅	[NEW] Realistic fake data generation (AWS, GitHub, Emails) using `Faker`.
Safety & Performance	Keyword Filtering	✅	Uses `ahocorasick-rs` automaton (with SIMD) to skip rules missing their required keywords.
	Parallel Scanning	✅	[NEW] Utilizes `ProcessPoolExecutor` for high-speed historical audits and multi-file directory scans.
	Commit Caching	✅	[NEW] Incremental scanning using `.secretscan_cache` to skip verified SHAs.
	Zero-Copy Scanning	✅	Uses `mmap` mapping with chunk overlaps for gigabyte-scale logs.
	ReDoS Protection	✅	`SIGALRM` timeouts (1s) for non-RE2 regex execution.
	Input Truncation	✅	Blocks capped at 1MB characters to prevent memory exhaustion.
	Deduplication	✅	Merges overlapping findings. Prioritizes longest matches.
	Force All Scan	✅	`--force-scan-all` bypasses keyword filters so every line is scored.
Reporting & UI	Surgical Highlighting	✅	[NEW] ANSI-colored context lines with the secret highlighted in red.
	Remediation Hints	✅	[NEW] Actionable advice with links to official provider documentation.
	Colorized Output	✅	ANSI colors for risk levels (Red=High, Yellow=Medium, Blue=Low).
	Report Formats	✅	`Summary` (counts only). `Short` (redacted). `Full` (raw secrets + context). `SARIF` (GitHub Code Scanning).
	CI/CD Friendly	✅	`--nocolors` flag. Standard exit codes for automation.
Testing & Dev	BDD Acceptance	✅	25 scenarios in `acceptance.feature` (including Git workflows) using `pytest-bdd`.
	Performance Bench	✅	[NEW] Automated suite to verify caching and parallelization gains.
	Unit Testing	✅	Comprehensive suite for core logic (detector, obfuscator, cli).
	Synthetic Corpus	✅	`generate_test_data.py` creates a balanced test set from rules.
	Rule Deduplication	✅	`tools/deduplicate_rules.py` keeps the catalog clean before release.

Git & CI/CD Integration

The detector is now natively aware of Git lifecycles, allowing for surgical scans of changes rather than entire files.

🛠 Git Scanning Modes

# Scan staged changes (perfect for pre-commit hooks)
./run.sh --git-staged --mode fast

# Scan unstaged changes in the working directory
./run.sh --git-working

# Scan the diff between a feature branch and main (PR audits)
./run.sh --git-branch origin/main --format sarif

# Deep audit of repository history (Parallelized & Cached)
./run.sh --scan-history --limit-depth 30 --limit-commits 100

🏎 Performance & Scalability

Parallel Execution: Large-scale historical audits and multi-file directory scans automatically utilize multiple CPU cores for regex and entropy analysis.
Commit Caching: The engine maintains a .secretscan_cache to track verified "clean" commits, reducing redundant scan times by up to 90% in incremental audits.
Modes: Choose between fast (optimized for <1s hooks), balanced (standard dev), and deep (thorough CI audits).

Surgical Highlighting & Remediation

When a secret is detected, the terminal output provides immediate visual context and actionable fix instructions.

⚠ Secrets detected: 1
- HIGH: 1

Type: stripe_api_key
Location: line 1
Risk: HIGH
Suggestion: Rotate this Stripe API key immediately in your dashboard. See: https://stripe.com/docs/keys#api-key-rotation
Context: config process result Stripe secret: [SECRET_HIGHLIGHTED_IN_RED]

Remediation hints now include direct links to official security guides for AWS, GitHub, Stripe, and Google Cloud to guide developers through the revocation and rotation process.

Natural Language Contextual Matching

The detection engine uses a 100-character context window to detect natural language conversational intents, such as Here is my prod api key:, which are common when interacting with LLMs. This feature is fully multi-lingual, boosting confidence scores when intent is detected in English, Spanish, French, or German.

Extended Infrastructure Mode

The latest feature expansion brings the infrastructure-focused taxonomy front and center:

data/infrastructure now houses rules for credit cards, IBANs/SEPA references, national ID numbers, and other high-risk identifiers.
Entropy-aware scoring plus overlap resolution lets structured infrastructure matches win over generic keywords or high-entropy heuristics.
The CLI --force-scan-all option ensures legacy logs that omit keywords still get evaluated (see the new acceptance scenario for this mode).
Dedicated tests cover deduped rules, synthetic obfuscation, and the expanded dataset to ensure the library stays precise.

Development Utilities

Keep the catalog healthy with the accompanying tools:

tools/migrate_patterns.py normalizes schema fields, adds entropy defaults, and maps external categories to the in-tree taxonomy.
tools/generate_test_data.py rebuilds the base64-encoded data/*/test_data.json files from regexes so every rule ships with reproducible samples.
tools/deduplicate_rules.py merges duplicate patterns across categories before rules ship.
Use tools/regex_lint.py, tools/run_safe_regex.py, and tools/run_redoctor.py to guard against ReDoS, syntax drift, and schema regressions.

Run pytest tests/test_acceptance.py::test_force_scan_keywordless before releasing to exercise the keywordless mode.

Pattern Database

The detection engine can leverage large open-source pattern databases containing thousands of secret signatures.

For example, open datasets include over 1600 regular expressions that detect API keys, tokens, passwords, and other credentials across hundreds of services. (GitHub)

This allows the scanner to stay updated with newly introduced API key formats.

Project Goals

The project focuses on protecting AI workflows rather than traditional repository scanning.

Key design goals:

AI-first security

Detect secrets inside:

LLM prompts
chat transcripts
agent logs
debugging sessions

Developer-first experience

The tool integrates directly into developer workflows without requiring complex configuration.

Local processing

All scanning occurs locally to ensure no data leaves the environment.

Fast feedback

Secrets should be detected immediately during development.

Core Components

The system is composed of several modules.

Detection Engine

Responsible for identifying potential secrets using:

regex pattern matching
entropy scoring
context heuristics

Pattern Database

A continuously updated collection of secret signatures.

Includes patterns for:

API providers
cloud platforms
CI/CD tokens
authentication systems

Scanner Interface

The scanner processes different input sources:

text prompts
log streams
source files
application outputs

Reporting System

Findings are returned as structured results including:

secret type
location
confidence score
risk level
risk score (0-100)

The CLI produces clear, color-coded output highlighting the location, risk level (HIGH, MEDIUM, LOW), and an Advanced Risk Score (0-100) of detected secrets. The risk score is determined by a weighted heuristic that incorporates regex confidence, contextual proximity bonuses, and entropy adjustments. The report.py module manages deduplication and formatting. Use --format sarif for CI/CD integration.

You can tune sensitivity and filter out low-confidence noise by using the --min-score flag (e.g., --min-score 70).

Architecture

The architecture prioritizes simplicity and speed.

Input Sources
    │
    │
    ▼
Preprocessing Layer
    │
    │
    ▼
Detection Engine
    ├── Regex Matching
    ├── Entropy Detection
    └── Context Analysis
    │
    ▼
Secret Classification
    │
    ▼
Security Report

Example Detection

Input text:

Here is my configuration:

DATABASE_URL=postgres://admin:password@localhost

Output:

Secrets detected:

[1] OpenAI API Key
location: line 3
risk: HIGH

[2] Database Credentials
location: line 4
risk: CRITICAL

Installation

# Install from PyPI (Recommended)
pip install py-secret-scan

# Run
secret-scan example_file.txt

Developer Installation

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install in editable mode
pip install -e .

Or scan text directly:

# Standard scan
secret-scan --text "My API key is AIzaSy-12345"

# Force scan all lines (bypasses keyword filters)
secret-scan --force-scan-all .

Data Obfuscation & Masking

You can redact sensitive data from logs or prompts while preserving the rest of the text. This is useful for sanitizing data before sharing it with an LLM or for safe debugging.

Enable obfuscation with the --obfuscate flag:

# Default mode: redact (Redacts middle of the secret)
# Input: "My key is ghp_1234567890abcdefghijklmnopqrstuvwx"
# Output: "My key is ghp_...uvwx"
cat logs.txt | secret-scan --obfuscate

Choose different obfuscation strategies with --obfuscate-mode:

1. `redact` (Default)

Partial masking that keeps the prefix/suffix for context but hides the sensitive core.

Example: AKIA...CDEF

2. `hash`

Replaces secrets with a consistent, short SHA-256 hash. Identical secrets will result in identical hashes, which is crucial for debugging data flows without seeing the actual values.

Example: [HASHED_d8c7b92f4a19]

3. `synthetic` (Recommended for LLM Prompts)

Replaces secrets with realistic-looking fake data that matches the original format (using the Faker library). This allows LLMs to still "understand" the structure of your data (e.g., seeing a fake AWS key where a real one was) without exposing real credentials.

Example (AWS ID): AKIAJ7O2N6M4L9K0P8R1
Example (GitHub Token): ghp_zXyWvUtSrQpOnMlKjIhGfEdCbA9876543210
Example (Email): fake_user@example.org

# Use synthetic mode for realistic placeholders
secret-scan --obfuscate --obfuscate-mode synthetic logs.txt

Custom CLI helpers

The repository ships with a few convenience commands:

./run.sh <file> is deprecated, use secret-scan <file> directly.
secret-scan --text "<string>" runs the scanner on an inline string (useful when building prompts before sending them to an LLM).
python tools/generate_test_data.py rebuilds data/test_data.json from data/rules.json and should be rerun whenever the rule set changes.

Usage Example

$ secret-scan test_file.py

⚠ Secrets detected: 1
- CRITICAL: 1

Type: Database Credentials
Location: line 10
Risk: CRITICAL
Content: post...ocal (redacted)

Test Data & Custom Cases

Every rule in data/rules.json maps to a base64-encoded positive and negative sample under data/test_data.json. tools/generate_test_data.py drives the data:

It loads each regex, runs it through exrex (with length caps) to emit matches, and encodes them so the detector tests operate on identical strings as production data.
Negatives are hand-crafted near-misses that resemble real-world secrets but should not trigger a hit.
Rules listed in STRICT_RULES bypass the default encode_str mutation because even inserting DUMMY_IGNORE would break the required format.
Custom helpers generate valid payloads for the trickiest patterns (auth0_domain_url, skybiometry_api_key, okta_api_domain_url, facebook_oauth_id, linemessaging_api_key, nethunt_api_key) so the detector still sees legal samples even though those regexes restrict character sets or lengths tightly.

Run python tools/generate_test_data.py after any rule changes; it prints progress every 100 rules and overwrites data/test_data.json with the refreshed corpus that powers the pytest suite.

Use Cases

AI Application Development

Developers building:

chatbots
RAG pipelines
AI agents
coding assistants

can scan prompts before sending them to LLM APIs.

Security Auditing

Security teams can analyze:

prompt logs
application logs
LLM interaction history

to ensure no secrets were exposed.

Compliance

Organizations can enforce policies preventing sensitive information from being sent to external AI providers.

DevSecOps Integration

The scanner can be integrated into:

CI/CD pipelines
AI gateways
API proxies
developer tooling

PII Detection

The scanner now supports detecting Personal Identifiable Information (PII) including emails, phone numbers, credit cards, and SSNs.

Enable PII detection with the --pii flag:

# Scan a file for secrets and PII
secret-scan --pii example_file.txt

# Limit PII scanning to specific regions (e.g., US only for SSNs and US phone numbers)
secret-scan --pii --pii-region US example_file.txt

PII findings are integrated into the multi-tier reporting system, where highly structured secrets (Tier 1) take precedence over contextual or generic entropy hits.

Target Users

AI Developers

Engineers building LLM-powered applications.

Security Engineers

Teams responsible for application security reviews.

AI Startups

Companies working with prompt engineering and LLM pipelines.

Roadmap

The project evolves in several stages.

CLI Scanner

A lightweight command-line tool for scanning prompts and logs.

API Service

A service that allows AI systems to validate prompts before sending them to LLM providers.

Developer Tooling

Integration with:

IDE plugins
Git hooks
CI pipelines

Enterprise Security Platform

Future capabilities may include:

real-time prompt filtering
AI data loss prevention (DLP)
secret monitoring across AI infrastructure

Why This Matters

AI-assisted development dramatically increases the speed of coding and debugging, but it also increases the risk of accidentally exposing sensitive data.

Developers frequently paste large blocks of code or logs into AI systems without reviewing them for secrets.

LLM Secrets Leak Detector provides a safety layer that prevents confidential data from leaving the organization.

License

MIT License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

JMartynov

Release history Release notifications | RSS feed

This version

3.0.2

Apr 9, 2026

3.0.1

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_secret_scan-3.0.2.tar.gz (217.6 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py_secret_scan-3.0.2-py3-none-any.whl (30.5 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file py_secret_scan-3.0.2.tar.gz.

File metadata

Download URL: py_secret_scan-3.0.2.tar.gz
Upload date: Apr 9, 2026
Size: 217.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_secret_scan-3.0.2.tar.gz
Algorithm	Hash digest
SHA256	`d0ceaa2322fe3b33547126abd01fb376be900ccca23ded1ccf00aba3bbcd6409`
MD5	`1911664caed1b39572e6bb84b6e2352f`
BLAKE2b-256	`50de8951c6238375c8e8963e574fcb8feb70d2a80b904cc842b3f75126a34aaa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_secret_scan-3.0.2.tar.gz:

Publisher: pypi-publish.yml on JMartynov/secret-scan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: py_secret_scan-3.0.2.tar.gz
- Subject digest: d0ceaa2322fe3b33547126abd01fb376be900ccca23ded1ccf00aba3bbcd6409
- Sigstore transparency entry: 1263244632
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: JMartynov/secret-scan@46c1fdf0da0bd6441d4bf0711e52b060ef25148c
- Branch / Tag: refs/tags/v3.0.3
- Owner: https://github.com/JMartynov
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@46c1fdf0da0bd6441d4bf0711e52b060ef25148c
- Trigger Event: push

File details

Details for the file py_secret_scan-3.0.2-py3-none-any.whl.

File metadata

Download URL: py_secret_scan-3.0.2-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 30.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_secret_scan-3.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e91877212bf72bbc86c2ff5835575b25102252aac0bb6987326cd4765a3a5abf`
MD5	`01aa51bc5397213b2bdaa610be9b2871`
BLAKE2b-256	`8918d80dd5145335c58b602e9a2f3c7b9611e2961d7dd4d34032d3f88c5d68cc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_secret_scan-3.0.2-py3-none-any.whl:

Publisher: pypi-publish.yml on JMartynov/secret-scan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: py_secret_scan-3.0.2-py3-none-any.whl
- Subject digest: e91877212bf72bbc86c2ff5835575b25102252aac0bb6987326cd4765a3a5abf
- Sigstore transparency entry: 1263244645
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: JMartynov/secret-scan@46c1fdf0da0bd6441d4bf0711e52b060ef25148c
- Branch / Tag: refs/tags/v3.0.3
- Owner: https://github.com/JMartynov
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@46c1fdf0da0bd6441d4bf0711e52b060ef25148c
- Trigger Event: push

py-secret-scan 3.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

LLM Secrets Leak Detector

Problem

Solution

Core Detection Approach

Secret Types Detected

API Keys

Cloud Credentials

Version Control Tokens

Authentication Secrets

Database Credentials

Cryptographic Material

Feature Matrix

Git & CI/CD Integration

🛠 Git Scanning Modes

🏎 Performance & Scalability

Surgical Highlighting & Remediation

Natural Language Contextual Matching

Extended Infrastructure Mode

Development Utilities

Pattern Database

Project Goals

AI-first security

Developer-first experience

Local processing

Fast feedback

Core Components

Detection Engine

Pattern Database

Scanner Interface

Reporting System

Architecture

Example Detection

Installation

Developer Installation

Data Obfuscation & Masking

1. redact (Default)

2. hash

3. synthetic (Recommended for LLM Prompts)

Custom CLI helpers

Usage Example

Test Data & Custom Cases

Use Cases

AI Application Development

Security Auditing

Compliance

DevSecOps Integration

PII Detection

Target Users

AI Developers

Security Engineers

AI Startups

Roadmap

CLI Scanner

API Service

Developer Tooling

Enterprise Security Platform

Why This Matters

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

1. `redact` (Default)

2. `hash`

3. `synthetic` (Recommended for LLM Prompts)