Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.
Project description
Alembic
Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.
Alembic is a local HTTP proxy and CLI that sits between your agent and the open web. It fetches a URL, strips navigation, ads, scripts, and boilerplate through a multi-stage extraction cascade, and returns clean LLM-ready Markdown — typically at 84–98% token reduction. It also rewrites registry/documentation URLs to their LLM-optimised equivalents, extracts PDF text, and searches the web via Brave Search. Everything runs locally; no API keys required for basic use.
Named for the alchemical distillation apparatus — we turn raw web pages into the pure essence an agent needs.
Install
pip install alembic-proxy
Or from source:
git clone https://github.com/InunuNet/Alembic.git
cd Alembic
pip install -e .
Quick Start
# Start the proxy (recommended for agent workflows)
alembic serve
# Distil a URL — returns clean Markdown
curl http://localhost:7077/https://example.com
# Search + distil + synthesise
curl "http://localhost:7077/?q=python+async+patterns&fetch=true"
# JSON response with full metadata
curl -H "Accept: application/json" http://localhost:7077/https://example.com
# CLI fetch with token savings report
alembic fetch https://example.com --stats
Features
URL Normalization
Before fetching, Alembic rewrites certain "human-readable" URLs to their LLM-optimised equivalents — documentation hosts and structured APIs instead of noisy HTML pages:
| From | To | Quality gain |
|---|---|---|
arxiv.org/pdf/{id} |
arxiv.org/abs/{id} |
PDF → clean abstract (quality 0→85) |
github.com/{owner}/{repo}/blob/{branch}/{file} |
raw.githubusercontent.com/... |
HTML → raw file (quality 35→84) |
hex.pm/packages/{name} |
hexdocs.pm/{name}/ |
Version list → docs (quality 30→88) |
rubygems.org/gems/{name} |
rubydoc.info/gems/{name}/ |
Version list → docs (quality 35→93) |
crates.io/crates/{name} |
docs.rs/{name}/ |
og-description → API docs (quality 85→100) |
formulae.brew.sh/formula/{name} |
formulae.brew.sh/api/formula/{name}.json |
HTML → JSON API (quality 52→85) |
formulae.brew.sh/cask/{name} |
formulae.brew.sh/api/cask/{name}.json |
HTML → JSON API |
npmjs.com/package/{name} |
registry.npmjs.org/{name}/latest |
HTML → JSON API (quality 35→85); supports @scope/pkg |
opam.ocaml.org/packages/{name} |
ocaml.org/p/{name}/latest |
HTML → docs (quality 33→88) |
mvnrepository.com/artifact/{G}/{A} |
search.maven.org/artifact/{G}/{A} |
HTML → Maven Central (quality 12→70) |
gopkg.in/{package} |
pkg.go.dev/gopkg.in/{package} |
install page → full API docs (quality 64→100) |
swiftpackageindex.com/{owner}/{name} |
github.com/{owner}/{name} |
SPI page → GitHub llms.txt (quality 55→100) |
lib.rs/crates/{name} |
docs.rs/{name}/ |
browser page → full API docs (quality 35→100) |
cran.r-project.org/package={name} |
rdocumentation.org/packages/{name} |
link-heavy → clean R docs (quality 55→100) |
clojars.org/{group}/{artifact} |
clojars.org/api/artifacts/{group}/{artifact} |
HTML → JSON API (quality 69→85) |
pypi.org/project/{name} |
pypi.org/pypi/{name}/json |
HTML → JSON API (quality 81→85; 46 → 6K-160K words) |
registry.terraform.io/providers/{ns}/{type} |
registry.terraform.io/v1/providers/{ns}/{type} |
HTML → JSON API (10 → 43K words) |
registry.terraform.io/modules/{ns}/{name}/{provider} |
registry.terraform.io/v1/modules/{ns}/{name}/{provider} |
HTML → JSON API (10 → 52K words) |
Extraction Cascade
Every URL goes through a cascade. Alembic stops at the first stage that produces clean content:
| Stage | Strategy | What it handles |
|---|---|---|
| Pre | URL normalization | Registry/doc URL rewrites (see table above) |
| Pre | arXiv abstract adapter | arxiv.org/abs/{id} → title + authors + abstract via lxml (quality 85) |
| Pre | PDF extraction | application/pdf → text via pypdf (5MB limit, encrypted/scanned = fallback) |
| 0a | Sitemap adapter | XML sitemap / sitemap index → clean URL list |
| 0b | RSS/Atom adapter | Feeds → structured Markdown with title + items |
| 0c | Page-type adapters | Recipes, forums (Lobste.rs, Reddit, HN, SE), products |
| 0d | SVG adapter | image/svg+xml → title + desc + text nodes |
| 0e | Code-file detection | text/plain with .py/.ts/.go/.rs/.yaml/.json ext → fenced code block |
| 1 | llms.txt discovery |
Sites with pre-built LLM index (+ URL-targeted excerpt; falls through if < 25 quality or < 50 words) |
| 1.5 | Hydration extraction | Next.js __NEXT_DATA__, Nuxt 3, Remix — SSR state without Playwright |
| 1.8 | JSON-LD articleBody |
Articles, HowTo, FAQPage, Event, Course embedded in structured data |
| 2 | Content negotiation | Servers that return text/markdown natively |
| 3 | Trafilatura | Production article extractor — handles most pages |
| 4 | Readability | Mozilla's DOM scoring — unusual layouts |
| 5 | FitCleaner | Heuristic block scoring — dev docs and engineering blogs |
| 6 | og:description fallback | When thin extraction < 50 words + og:description ≥ 30 chars |
| 7 | Fallback | Basic tag stripping — always succeeds |
strategy: llms.txt = best possible result. strategy: fallback or og-description = yellow flag — JS-heavy SPA or paywall; check X-Alembic-JS-Hint-Score.
Bot Protection Bypass
Alembic ships a pool of 50 correlated synthetic browser personas — each with a consistent OS, browser version, screen, GPU, timezone, and language fingerprint. The curl_cffi fetch stage uses TLS impersonation matched to the persona's browser version to defeat Cloudflare Bot Management at the JA3/JA4 layer.
Stage 5 (optional) adds patchright (stealth Playwright) + a residential proxy for DataDome and Akamai Bot Manager.
| Site | Protection | Result |
|---|---|---|
| AllRecipes | Cloudflare | Passes via curl_cffi |
| Reuters | Cloudflare | Passes via curl_cffi |
| Leboncoin | DataDome | Blocked (Stage 5 target) |
| Glassdoor | Cloudflare Bot Management | Blocked (Stage 5 target) |
Enable Stage 5:
ALEMBIC_STEALTH=1 ALEMBIC_PROXY_URL=http://user:pass@proxy:port alembic serve
JSON API Distillation
Pass a JMESPath filter to extract fields from JSON APIs without writing glue code:
# Filter a JSON API response
curl "http://localhost:7077/https://api.example.com/users" \
-H "X-Alembic-JQ: data[*].email"
Invalid expressions return HTTP 400 with {"error": "...", "expression": "..."}.
Response Headers
Every response carries telemetry headers:
| Header | Value |
|---|---|
X-Alembic-Strategy |
Extraction strategy: trafilatura, llms.txt, llms.txt:excerpt, hydration-*, rss-feed, sitemap, svg-text, code-file, adapter:arxiv, pdf-text, pdf-unsupported, json-passthrough, json-jmespath, og-description, plain-text, fallback |
X-Alembic-Page-Type |
article, recipe, forum, product, api, youtube, unknown |
X-Alembic-Title |
Extracted page title |
X-Alembic-Author |
Extracted author (if available) |
X-Alembic-Date |
Extracted publish date (if available) |
X-Alembic-Language |
Page language as BCP-47 primary subtag (en, fr, de, …). Empty when unknown. |
X-Alembic-Word-Count |
Word count of the clean extracted content |
X-Alembic-Link-Count |
Number of unique links extracted from the page (available in JSON envelope as links[{url,text}]) |
X-Alembic-JS-Hint |
true if the page shows strong JavaScript-rendering signals |
X-Alembic-JS-Hint-Score |
JS hint confidence score 0–10. ≥6 = likely SPA, retry with ?js=true |
X-Alembic-Cached |
true / false |
X-Alembic-Original-Tokens |
Token count before extraction |
X-Alembic-Clean-Tokens |
Token count after extraction |
X-Alembic-Saved-Pct |
Percentage of raw tokens saved (e.g. 93%) |
X-Alembic-Yield-Pct |
Percentage of raw tokens in clean output. Low yield (< 1%) = likely SPA or paywall — use X-Alembic-JS-Hint-Score to decide whether to retry with JS |
X-Alembic-Quality-Score |
0–100 content quality score. 80+ = clean prose/docs; 45–79 = moderate; 0–19 = challenge page or empty |
X-Alembic-Blocked |
true if a bot-wall interstitial was detected |
X-Alembic-Blocked-By |
Blocker name: cloudflare, datadome, perimeterx, incapsula, kasada, bot_wall, unknown |
X-Alembic-Retry |
1 if a second persona was tried automatically after first block |
X-Alembic-Upstream-Status |
HTTP status from the upstream server (4xx/5xx only) |
X-Alembic-Wait-Status |
Playwright wait-for-selector outcome |
X-Alembic-Search-Backend |
brave / searxng |
X-Alembic-Search-Count |
Number of search results returned |
Configuration
| Variable | Default | Purpose |
|---|---|---|
ALEMBIC_PROXY_URL |
— | Outbound proxy for fetch requests (http://user:pass@host:port or socks5://…) |
ALEMBIC_STEALTH |
0 |
Set to 1 to enable Stage 5 patchright stealth browser |
ALEMBIC_RATE_LIMIT_RPM |
0 |
Per-IP rate limit in requests/minute. 0 = disabled. Exceeded requests get HTTP 429 + Retry-After header. Health endpoint exempt. |
ALEMBIC_BLOCK_RETRY |
1 |
Automatically retry once with a fresh browser persona when a block is detected. Defeats probabilistic ML scoring (Cloudflare BM). Set to 0 to disable. |
ALEMBIC_SEARXNG_URL |
— | SearXNG instance URL for web search (e.g. http://localhost:8080). Takes priority over Brave when set. |
ALEMBIC_SEARCH_BRAVE_API_KEY |
— | Brave Search API key (2,000 queries/month free) |
BRAVE_SEARCH_API_KEY |
— | Alias for the Brave API key |
ANTHROPIC_API_KEY |
— | Claude Haiku for search synthesis (?fetch=true) |
FIRECRAWL_API_KEY |
— | Firecrawl SaaS JS rendering |
BROWSERLESS_API_TOKEN |
— | Browserless SaaS JS rendering |
GITHUB_TOKEN |
— | GitHub personal access token → 5,000 req/hr on api.github.com (default: 60/hr unauthenticated) |
SEMANTIC_SCHOLAR_API_KEY |
— | Semantic Scholar API key → per-key rate limit on api.semanticscholar.org |
HF_TOKEN |
— | HuggingFace token → private model access + higher rate limits on huggingface.co |
Authentication
When deploying Alembic as a public service, set ALEMBIC_API_KEY to require authentication:
export ALEMBIC_API_KEY=your-secret-key
alembic serve
Clients must then include one of:
curl -H "Authorization: Bearer your-secret-key" http://your-server:7077/https://example.com
curl -H "X-API-Key: your-secret-key" http://your-server:7077/https://example.com
The health endpoint (GET /) is always accessible without auth so you can check service status.
Without ALEMBIC_API_KEY, no auth is required (default — local dev mode).
Docker
# Pull and run
docker compose up -d
# Or build from source
docker build -t alembic-proxy .
docker run -p 7077:7077 -e ALEMBIC_API_KEY=secret alembic-proxy
See docs/DEPLOYMENT.md for Fly.io, Railway, and Render deployment guides.
Health Check
curl -H "Accept: application/json" http://localhost:7077/
# → {"status": "ok", "version": "1.13.0", "cache": "active"}
Plain text health check:
curl http://localhost:7077/
# → Alembic Proxy v1.13.0
CLI Reference
| Command | Purpose |
|---|---|
alembic <url> |
Fetch and print clean content |
alembic fetch <url> --stats |
Fetch with token savings report |
alembic batch <urls…> |
Fetch multiple URLs in parallel |
alembic search "query" |
Web search via Brave / Google |
alembic search "query" --fetch |
Search + distil + synthesise |
alembic serve |
Start the HTTP proxy on localhost:7077 |
alembic clear |
Clear the entire cache |
alembic clear-url <url> |
Evict a single URL from cache |
alembic vacuum |
Remove expired entries, reclaim disk space |
alembic lifetime |
Show lifetime token savings stats |
Key flags for fetch:
| Flag | Effect |
|---|---|
--format markdown|json|text |
Output format (default: markdown) |
--stats |
Print token savings report |
--no-cache |
Bypass cache, always refetch |
--js |
Use Playwright for JS rendering |
--auto-js |
Auto-escalate to JS if page is heavily dynamic |
--saas firecrawl|browserless |
Use cloud rendering |
-H "Key: Value" |
Forward custom header to target site |
--ls key=value |
Inject into browser localStorage |
--ss key=value |
Inject into sessionStorage |
Advanced Usage
Authenticated SPA (localStorage injection)
curl "http://localhost:7077/https://app.example.com/dashboard?js=true" \
-H "X-Alembic-LocalStorage: session_token=eyJ..."
JSON API with JMESPath filtering
curl "http://localhost:7077/https://api.example.com/users" \
-H "Authorization: Bearer token" \
-H "X-Alembic-JQ: data[*].email"
Python API
from src.processor import Processor
from src.config import DEFAULT_CONFIG
import asyncio
processor = Processor(DEFAULT_CONFIG)
result = asyncio.run(processor.process("https://example.com", fmt="markdown"))
print(result.content) # clean Markdown
print(result.strategy) # trafilatura / llms.txt / hydration / etc.
print(result.original_tokens) # token count before
print(result.clean_tokens) # token count after
print(result.page_type) # article / recipe / forum / product / unknown
print(result.author) # extracted author (if available)
print(result.publish_date) # extracted date (if available)
Development
make test # unit tests (685 passing)
make install-daemon # install as launchd service (macOS)
# Run tests directly
pytest tests/ -q
# Expected: 685+ passing, 0 failures
# Live integration tests (30 URLs, real proxy)
pytest tests/integration/ -q
Documentation
| File | Contents |
|---|---|
llms.txt |
AI-to-AI quick reference — comprehensive, machine-readable |
docs/API.md |
Complete CLI, proxy, and Python API reference |
docs/ARCHITECTURE.md |
System design, data flow, module map |
docs/GUIDE.md |
Integration patterns and practical recipes |
docs/CHANGELOG.md |
Version history |
License
MIT. See LICENSE for the full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alembic_proxy-1.61.0.tar.gz.
File metadata
- Download URL: alembic_proxy-1.61.0.tar.gz
- Upload date:
- Size: 244.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea170889c7a913820e2b7ebd5a6dc26a7a1af90389c0532d713eeded3cdeb1db
|
|
| MD5 |
2a58fc99ea7c325c03b5e2b25262bbb0
|
|
| BLAKE2b-256 |
43db945bdff4b30363cd7e12bf5cc5ab5aa49d5efefafce5f2fa1181b13441a2
|
Provenance
The following attestation bundles were made for alembic_proxy-1.61.0.tar.gz:
Publisher:
publish.yml on InunuNet/Alembic
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
alembic_proxy-1.61.0.tar.gz -
Subject digest:
ea170889c7a913820e2b7ebd5a6dc26a7a1af90389c0532d713eeded3cdeb1db - Sigstore transparency entry: 1584015042
- Sigstore integration time:
-
Permalink:
InunuNet/Alembic@2f3a9862b4007bef4ce437e0beebe287c89de01d -
Branch / Tag:
refs/tags/v1.61.0 - Owner: https://github.com/InunuNet
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2f3a9862b4007bef4ce437e0beebe287c89de01d -
Trigger Event:
release
-
Statement type:
File details
Details for the file alembic_proxy-1.61.0-py3-none-any.whl.
File metadata
- Download URL: alembic_proxy-1.61.0-py3-none-any.whl
- Upload date:
- Size: 100.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5e31a76430ae59561b0639fd3bd0e9e8339ba73535ebe06a5c76ab25642a3ae
|
|
| MD5 |
b1825363f43e0fcbffae919710c1c14a
|
|
| BLAKE2b-256 |
09ac6d4652427274bb3764fad35bbe3d23fb795f3e351bf0223aa126b8d930a3
|
Provenance
The following attestation bundles were made for alembic_proxy-1.61.0-py3-none-any.whl:
Publisher:
publish.yml on InunuNet/Alembic
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
alembic_proxy-1.61.0-py3-none-any.whl -
Subject digest:
e5e31a76430ae59561b0639fd3bd0e9e8339ba73535ebe06a5c76ab25642a3ae - Sigstore transparency entry: 1584015167
- Sigstore integration time:
-
Permalink:
InunuNet/Alembic@2f3a9862b4007bef4ce437e0beebe287c89de01d -
Branch / Tag:
refs/tags/v1.61.0 - Owner: https://github.com/InunuNet
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2f3a9862b4007bef4ce437e0beebe287c89de01d -
Trigger Event:
release
-
Statement type: