Deterministic, Injection-Resistant Web → Markdown Engine for LLM Pipelines
Project description
MarkDownIngress
Deterministic, injection-resistant Web → Markdown engine for LLM pipelines
Overview
MarkDownIngress is a security-first web content ingestion engine for LLM pipelines. It fetches web pages, sanitizes HTML via Mozilla Readability, detects prompt injection patterns, converts to token-optimized Markdown, and produces deterministic output. It ships as a Python library, a FastAPI server, and a CLI.
It is not a recursive crawler, a full RAG framework, or a generic HTML→Markdown converter. It is an ingestion security boundary that flags untrusted content before it reaches a model.
Key Features
| Feature | Description |
|---|---|
| Injection Detection | 10+ pattern detectors with 0.0–1.0 risk scoring; optional Nova / LLM tiers |
| Token Optimization | 70–80% average token reduction via Readability + sanitization |
| Deterministic Output | Stable Markdown and SHA256 content/structural hashes in fast mode |
| Fast / Render / Auto | HTTP-only, Playwright SPA rendering, or automatic fallback |
| Structured Blocks & Chunks | Heading/table/code/list extraction with stable RAG chunks |
| Domain Policies | Per-host overrides for mode, thresholds, selectors, allowed/blocked tags |
| Output Profiles | llm_safe, rag_chunkable, for_search, for_archive, default |
| Batch & Async | Concurrent ingestion with in-flight dedup and per-mode stats |
| Library + CLI + API | ingest() / markdown-ingress / FastAPI /api/v1/* |
Supported Outputs
Document SafeDocument (markdown + metadata + hashes + score + flags)
Serialization Markdown, JSON
Security Injection score 0.0–1.0, risk level, JSON security report
Structure Structured blocks, native chunks, structural hash
API FastAPI /api/v1 with persistent batch jobs + webhooks
Installation
From PyPI (Recommended)
pip install markdown-ingress
From Source
git clone https://github.com/seifreed/MarkDownIngress.git
cd MarkDownIngress
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .
Optional Extras
pip install "markdown-ingress[all]" # everything
pip install "markdown-ingress[render]" # Playwright SPA rendering
pip install "markdown-ingress[security]" # Nova advanced injection detection
pip install "markdown-ingress[api]" # FastAPI server
# Render mode also needs a browser binary:
playwright install chromium
Quick Start
# Ingest a single URL and print the report
markdown-ingress ingest https://example.com
# Save sanitized Markdown to a file
markdown-ingress ingest https://example.com --save example.md
# JSON output with metadata, hashes, and injection score
markdown-ingress ingest https://example.com --json --save example.json
Example report output:
============================================================
MarkDownIngress v1.0.0 - Ingestion Report
============================================================
📄 Title: Example Domain
🔗 URL: http://example.com
✔ Tokens: 33
↳ Saved: 119 tokens (78.29% reduction)
🔒 Injection Score: 0.000 (SAFE)
🔑 Hash: sha256:d6ac852cf2392c04d2cf3e3e4156f786cfbc4f46308ebe756ebd72cf9ffef4ef
⏱️ Fetch time: 116ms
Usage
Command Line Interface
# Render JavaScript-heavy SPAs with Playwright
markdown-ingress ingest https://spa-app.example.com --render --timeout 60
# RAG-ready structured output with heading-based chunks
markdown-ingress ingest https://docs.example.com \
--output-profile rag_chunkable \
--extract-blocks \
--chunking-strategy heading \
--show-chunks
# Batch a URL list into a directory of Markdown files
markdown-ingress batch urls.txt --output results/
# Compare extractors on a local HTML file (runs offline)
markdown-ingress compare tests/fixtures/technical_doc.html --json
# Benchmark token reduction across a URL list
markdown-ingress benchmark urls.txt --iterations 5 --compare-extractors
Commands
| Command | Description |
|---|---|
markdown-ingress ingest <url> |
Ingest a single URL (text, --json, --save) |
markdown-ingress batch <file> |
Process a newline-delimited URL file concurrently |
markdown-ingress compare <html> |
Compare Readability vs. Trafilatura on local HTML |
markdown-ingress benchmark <file> |
Measure latency and token reduction over a URL list |
Key Flags (ingest)
| Option | Description |
|---|---|
--render / --fast |
Force Playwright render mode or HTTP-only fast mode |
--strict / --permissive |
Toggle the strict security threshold (strict is default) |
--config FILE |
Load runtime settings from a YAML/JSON config file |
--model MODEL |
Token-estimation model (gpt-4, claude, gpt-3.5-turbo) |
--output-profile PROFILE |
Apply llm_safe, rag_chunkable, for_search, for_archive |
--extract-blocks |
Emit structured blocks (headings, tables, code, lists) |
--chunking-strategy {none,heading,size} |
Build stable native chunks |
--domain-policy-file FILE |
Load per-host policy overrides from JSON |
--json / --save FILE |
Emit JSON / write primary output to a file |
--show-observability |
Print stage timings and policy/cost telemetry |
Python Library
Basic Usage
from markdown_ingress import ingest
doc = ingest("https://example.com", mode="fast", strict=True)
print(doc.markdown) # Sanitized Markdown
print(doc.token_estimate) # Token count for the chosen model
print(doc.injection_score) # 0.0 (safe) → 1.0 (critical)
print(doc.content_hash) # "sha256:..." for dedup/versioning
print(doc.flags) # Security warning flags
Batch and Async
import asyncio
from markdown_ingress import ingest_many, ingest_async
# Concurrent batch ingestion with in-flight dedup
result = ingest_many(
["https://example.com/a", "https://example.com/b"],
mode="auto",
max_concurrent=4,
)
print(f"safe: {result.successful}/{result.total}")
# Async single ingestion
async def main():
doc = await ingest_async("https://example.com", mode="auto")
print(doc.metadata["title"])
asyncio.run(main())
RAG-Ready Structured Output
from markdown_ingress import ingest
doc = ingest(
"https://docs.example.com/guide",
mode="fast",
output_profile="rag_chunkable",
extract_blocks=True,
chunking_strategy="heading",
)
print(doc.structured_blocks[0]["block_type"])
print(doc.chunks[0]["chunk_id"])
Security Report API
from markdown_ingress import generate_security_report
report = generate_security_report("https://suspicious-site.example.com")
report.save("security_report.json")
print(report.injection_score) # numeric risk score
print(report.risk_level) # SAFE / LOW / MEDIUM / HIGH / CRITICAL
print(report.token_reduction_percent) # token savings %
print(report.pattern_matches) # matched injection patterns
Domain-Specific Hardening
from markdown_ingress import DomainPolicy, ingest
doc = ingest(
"https://forum.example.com/thread",
mode="auto",
domain_policies=[
DomainPolicy(
domain="forum.example.com",
output_profile="llm_safe",
policy_name="strict",
blocked_selectors=[".reply-box", ".promo"],
blocked_tags=["form"],
)
],
)
print(doc.metadata["domain_policy"])
More runnable scripts live in examples/ — see
library_usage.py and
library_batch_async.py.
Security Model
Injection Detection Patterns
| Pattern | Weight | Example |
|---|---|---|
| Instruction Override | 0.8 | "ignore previous instructions" |
| Secret Extraction | 0.9 | "reveal secret keys" |
| Mode Switching | 0.7 | "enable developer mode" |
| System Prompt Access | 0.6 | "reveal system prompt" |
| Policy Override | 0.8 | "override policy settings" |
| Model Manipulation | 0.5 | "you are ChatGPT" |
Risk Levels
| Score | Level | Action |
|---|---|---|
| 0.0 – 0.2 | SAFE | Content appears safe |
| 0.2 – 0.4 | LOW | Review recommended |
| 0.4 – 0.6 | MEDIUM | Manual review required |
| 0.6 – 0.8 | HIGH | Use with caution |
| 0.8 – 1.0 | CRITICAL | Blocking recommended |
Base installs use deterministic heuristics. Install [security] to add Nova
semantic detection, and set ANTHROPIC_API_KEY with --use-llm for the
optional LLM-assisted tier.
FastAPI Server
pip install "markdown-ingress[api]"
uvicorn markdown_ingress.api_server:app --port 8000
Versioned endpoints under /api/v1:
POST /api/v1/ingest Single URL ingestion
POST /api/v1/ingest/batch Synchronous batch
POST /api/v1/jobs/batch Persistent batch job (TTL + optional webhook)
GET /api/v1/jobs/{job_id} Job status / result
POST /api/v1/security/report Security report for a URL
GET /api/v1/stats Process-level ingest stats
GET /api/v1/health Health check
curl -X POST http://localhost:8000/api/v1/ingest \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com","mode":"fast","strict":true}'
Requirements
- Python 3.13 or 3.14
- Core:
httpx,selectolax,readability-lxml,markdownify,tiktoken - Optional:
playwright(render),nova-hunting(security),fastapi(api) - See pyproject.toml for the complete dependency list
Development
git clone https://github.com/seifreed/MarkDownIngress.git
cd MarkDownIngress
python3 -m venv venv && source venv/bin/activate
pip install -e ".[dev]"
playwright install chromium
make test # full local suite (campaign/baseline excluded)
make test-fast # suite excluding opt-in live dataset tests
ruff check markdown_ingress tests
black --check markdown_ingress tests
mypy markdown_ingress
bandit -r markdown_ingress
Every bug fix must include a regression test that fails before the fix and
passes after it. ruff, black --check, mypy, and bandit must pass
before code is considered complete.
Release
Releases are tag driven. Update the package version in pyproject.toml and
markdown_ingress/__init__.py, commit the change, then create and push a v*
tag:
git tag v1.0.0
git push origin v1.0.0
The publish workflow builds the wheel and source distribution, checks them with
twine, creates or updates the matching GitHub Release, uploads dist/* as
release assets, and publishes to PyPI via Trusted Publishing/OIDC. No long-lived
PyPI API token is used.
Configure the PyPI trusted publisher before pushing a release tag:
- Owner:
seifreed - Repository:
MarkDownIngress - Workflow:
publish.yml - Environment:
pypi
Support the Project
If this project is useful in your workflows, you can support development:
License
This project is licensed under the MIT License. See LICENSE.
Attribution
- Author: Marc Rivero López | @seifreed
- Repository: github.com/seifreed/MarkDownIngress
Built for the LLM era. Secure by default. Deterministic by design.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdown_ingress-1.0.0.tar.gz.
File metadata
- Download URL: markdown_ingress-1.0.0.tar.gz
- Upload date:
- Size: 392.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3aa684e4678da4c83355dcab4bb94fcd2416141c4f3082d8c81a619abef0eff5
|
|
| MD5 |
0fe2e978e4bdfe1907691fe167ab6a28
|
|
| BLAKE2b-256 |
99199e95089cbcd5aaca523ae139ef1c6ad33b677482071551a9fcca11ea17c9
|
Provenance
The following attestation bundles were made for markdown_ingress-1.0.0.tar.gz:
Publisher:
publish.yml on seifreed/MarkDownIngress
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdown_ingress-1.0.0.tar.gz -
Subject digest:
3aa684e4678da4c83355dcab4bb94fcd2416141c4f3082d8c81a619abef0eff5 - Sigstore transparency entry: 1594980494
- Sigstore integration time:
-
Permalink:
seifreed/MarkDownIngress@214b6296f9ab623eb74a1dc572a57a8aa8a63924 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/seifreed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@214b6296f9ab623eb74a1dc572a57a8aa8a63924 -
Trigger Event:
push
-
Statement type:
File details
Details for the file markdown_ingress-1.0.0-py3-none-any.whl.
File metadata
- Download URL: markdown_ingress-1.0.0-py3-none-any.whl
- Upload date:
- Size: 312.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3177fbb8d54b33468268968bae1cc2065cb055d5d8ffb4c85622296817de1169
|
|
| MD5 |
973a3f69f34c7260b408b5d119e45bde
|
|
| BLAKE2b-256 |
53301ab7dcdbb8a40470b0163d836998c984ebb8d09359006df23ac77ac9fe40
|
Provenance
The following attestation bundles were made for markdown_ingress-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on seifreed/MarkDownIngress
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdown_ingress-1.0.0-py3-none-any.whl -
Subject digest:
3177fbb8d54b33468268968bae1cc2065cb055d5d8ffb4c85622296817de1169 - Sigstore transparency entry: 1594980594
- Sigstore integration time:
-
Permalink:
seifreed/MarkDownIngress@214b6296f9ab623eb74a1dc572a57a8aa8a63924 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/seifreed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@214b6296f9ab623eb74a1dc572a57a8aa8a63924 -
Trigger Event:
push
-
Statement type: