Skip to main content

Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.

Project description

Alembic

Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.

Alembic is a local HTTP proxy and CLI that sits between your agent and the open web. It fetches a URL, strips navigation, ads, scripts, and boilerplate through a multi-stage extraction cascade, and returns clean LLM-ready Markdown — typically at 84–98% token reduction. It also rewrites registry/documentation URLs to their LLM-optimised equivalents, extracts PDF text, and searches the web via Brave Search. Everything runs locally; no API keys required for basic use.

Named for the alchemical distillation apparatus — we turn raw web pages into the pure essence an agent needs.


Install

pip install alembic-proxy

Or from source:

git clone https://github.com/InunuNet/Alembic.git
cd Alembic
pip install -e .

Quick Start

# Start the proxy (recommended for agent workflows)
alembic serve

# Distil a URL — returns clean Markdown
curl http://localhost:7077/https://example.com

# Search + distil + synthesise
curl "http://localhost:7077/?q=python+async+patterns&fetch=true"

# JSON response with full metadata
curl -H "Accept: application/json" http://localhost:7077/https://example.com

# CLI fetch with token savings report
alembic fetch https://example.com --stats

Features

URL Normalization

Before fetching, Alembic rewrites certain "human-readable" URLs to their LLM-optimised equivalents — documentation hosts and structured APIs instead of noisy HTML pages:

From To Quality gain
arxiv.org/pdf/{id} arxiv.org/abs/{id} PDF → clean abstract (quality 0→85)
github.com/{owner}/{repo}/blob/{branch}/{file} raw.githubusercontent.com/... HTML → raw file (quality 35→84)
hex.pm/packages/{name} hexdocs.pm/{name}/ Version list → docs (quality 30→88)
rubygems.org/gems/{name} rubydoc.info/gems/{name}/ Version list → docs (quality 35→93)
crates.io/crates/{name} docs.rs/{name}/ og-description → API docs (quality 85→100)
formulae.brew.sh/formula/{name} formulae.brew.sh/api/formula/{name}.json HTML → JSON API (quality 52→85)
formulae.brew.sh/cask/{name} formulae.brew.sh/api/cask/{name}.json HTML → JSON API
npmjs.com/package/{name} registry.npmjs.org/{name}/latest HTML → JSON API (quality 35→85); supports @scope/pkg
opam.ocaml.org/packages/{name} ocaml.org/p/{name}/latest HTML → docs (quality 33→88)
mvnrepository.com/artifact/{G}/{A} search.maven.org/artifact/{G}/{A} HTML → Maven Central (quality 12→70)
gopkg.in/{package} pkg.go.dev/gopkg.in/{package} install page → full API docs (quality 64→100)
swiftpackageindex.com/{owner}/{name} github.com/{owner}/{name} SPI page → GitHub llms.txt (quality 55→100)
lib.rs/crates/{name} docs.rs/{name}/ browser page → full API docs (quality 35→100)
cran.r-project.org/package={name} rdocumentation.org/packages/{name} link-heavy → clean R docs (quality 55→100)
clojars.org/{group}/{artifact} clojars.org/api/artifacts/{group}/{artifact} HTML → JSON API (quality 69→85)
pypi.org/project/{name} pypi.org/pypi/{name}/json HTML → JSON API (quality 81→85; 46 → 6K-160K words)
registry.terraform.io/providers/{ns}/{type} registry.terraform.io/v1/providers/{ns}/{type} HTML → JSON API (10 → 43K words)
registry.terraform.io/modules/{ns}/{name}/{provider} registry.terraform.io/v1/modules/{ns}/{name}/{provider} HTML → JSON API (10 → 52K words)

Extraction Cascade

Every URL goes through a cascade. Alembic stops at the first stage that produces clean content:

Stage Strategy What it handles
Pre URL normalization Registry/doc URL rewrites (see table above)
Pre arXiv abstract adapter arxiv.org/abs/{id} → title + authors + abstract via lxml (quality 85)
Pre PDF extraction application/pdf → text via pypdf (5MB limit, encrypted/scanned = fallback)
0a Sitemap adapter XML sitemap / sitemap index → clean URL list
0b RSS/Atom adapter Feeds → structured Markdown with title + items
0c Page-type adapters Recipes, forums (Lobste.rs, Reddit, HN, SE), products
0d SVG adapter image/svg+xml → title + desc + text nodes
0e Code-file detection text/plain with .py/.ts/.go/.rs/.yaml/.json ext → fenced code block
1 llms.txt discovery Sites with pre-built LLM index (+ URL-targeted excerpt; falls through if < 25 quality or < 50 words)
1.5 Hydration extraction Next.js __NEXT_DATA__, Nuxt 3, Remix — SSR state without Playwright
1.8 JSON-LD articleBody Articles, HowTo, FAQPage, Event, Course embedded in structured data
2 Content negotiation Servers that return text/markdown natively
3 Trafilatura Production article extractor — handles most pages
4 Readability Mozilla's DOM scoring — unusual layouts
5 FitCleaner Heuristic block scoring — dev docs and engineering blogs
6 og:description fallback When thin extraction < 50 words + og:description ≥ 30 chars
7 Fallback Basic tag stripping — always succeeds

strategy: llms.txt = best possible result. strategy: fallback or og-description = yellow flag — JS-heavy SPA or paywall; check X-Alembic-JS-Hint-Score.

Bot Protection Bypass

Alembic ships a pool of 50 correlated synthetic browser personas — each with a consistent OS, browser version, screen, GPU, timezone, and language fingerprint. The curl_cffi fetch stage uses TLS impersonation matched to the persona's browser version to defeat Cloudflare Bot Management at the JA3/JA4 layer.

Stage 5 (optional) adds patchright (stealth Playwright) + a residential proxy for DataDome and Akamai Bot Manager.

Site Protection Result
AllRecipes Cloudflare Passes via curl_cffi
Reuters Cloudflare Passes via curl_cffi
Leboncoin DataDome Blocked (Stage 5 target)
Glassdoor Cloudflare Bot Management Blocked (Stage 5 target)

Enable Stage 5:

ALEMBIC_STEALTH=1 ALEMBIC_PROXY_URL=http://user:pass@proxy:port alembic serve

JSON API Distillation

Pass a JMESPath filter to extract fields from JSON APIs without writing glue code:

# Filter a JSON API response
curl "http://localhost:7077/https://api.example.com/users" \
  -H "X-Alembic-JQ: data[*].email"

Invalid expressions return HTTP 400 with {"error": "...", "expression": "..."}.

Response Headers

Every response carries telemetry headers:

Header Value
X-Alembic-Strategy Extraction strategy: trafilatura, llms.txt, llms.txt:excerpt, hydration-*, rss-feed, sitemap, svg-text, code-file, adapter:arxiv, pdf-text, pdf-unsupported, json-passthrough, json-jmespath, og-description, plain-text, fallback
X-Alembic-Page-Type article, recipe, forum, product, api, youtube, unknown
X-Alembic-Title Extracted page title
X-Alembic-Author Extracted author (if available)
X-Alembic-Date Extracted publish date (if available)
X-Alembic-Language Page language as BCP-47 primary subtag (en, fr, de, …). Empty when unknown.
X-Alembic-Word-Count Word count of the clean extracted content
X-Alembic-Link-Count Number of unique links extracted from the page (available in JSON envelope as links[{url,text}])
X-Alembic-JS-Hint true if the page shows strong JavaScript-rendering signals
X-Alembic-JS-Hint-Score JS hint confidence score 0–10. ≥6 = likely SPA, retry with ?js=true
X-Alembic-Cached true / false
X-Alembic-Original-Tokens Token count before extraction
X-Alembic-Clean-Tokens Token count after extraction
X-Alembic-Saved-Pct Percentage of raw tokens saved (e.g. 93%)
X-Alembic-Yield-Pct Percentage of raw tokens in clean output. Low yield (< 1%) = likely SPA or paywall — use X-Alembic-JS-Hint-Score to decide whether to retry with JS
X-Alembic-Quality-Score 0–100 content quality score. 80+ = clean prose/docs; 45–79 = moderate; 0–19 = challenge page or empty
X-Alembic-Blocked true if a bot-wall interstitial was detected
X-Alembic-Blocked-By Blocker name: cloudflare, datadome, perimeterx, incapsula, kasada, bot_wall, unknown
X-Alembic-Retry 1 if a second persona was tried automatically after first block
X-Alembic-Upstream-Status HTTP status from the upstream server (4xx/5xx only)
X-Alembic-Wait-Status Playwright wait-for-selector outcome
X-Alembic-Search-Backend brave / searxng
X-Alembic-Search-Count Number of search results returned

Configuration

Variable Default Purpose
ALEMBIC_PROXY_URL Outbound proxy for fetch requests (http://user:pass@host:port or socks5://…)
ALEMBIC_STEALTH 0 Set to 1 to enable Stage 5 patchright stealth browser
ALEMBIC_RATE_LIMIT_RPM 0 Per-IP rate limit in requests/minute. 0 = disabled. Exceeded requests get HTTP 429 + Retry-After header. Health endpoint exempt.
ALEMBIC_BLOCK_RETRY 1 Automatically retry once with a fresh browser persona when a block is detected. Defeats probabilistic ML scoring (Cloudflare BM). Set to 0 to disable.
ALEMBIC_SEARXNG_URL SearXNG instance URL for web search (e.g. http://localhost:8080). Takes priority over Brave when set.
ALEMBIC_SEARCH_BRAVE_API_KEY Brave Search API key (2,000 queries/month free)
BRAVE_SEARCH_API_KEY Alias for the Brave API key
ANTHROPIC_API_KEY Claude Haiku for search synthesis (?fetch=true)
FIRECRAWL_API_KEY Firecrawl SaaS JS rendering
BROWSERLESS_API_TOKEN Browserless SaaS JS rendering
GITHUB_TOKEN GitHub personal access token → 5,000 req/hr on api.github.com (default: 60/hr unauthenticated)
SEMANTIC_SCHOLAR_API_KEY Semantic Scholar API key → per-key rate limit on api.semanticscholar.org
HF_TOKEN HuggingFace token → private model access + higher rate limits on huggingface.co

Authentication

When deploying Alembic as a public service, set ALEMBIC_API_KEY to require authentication:

export ALEMBIC_API_KEY=your-secret-key
alembic serve

Clients must then include one of:

curl -H "Authorization: Bearer your-secret-key" http://your-server:7077/https://example.com
curl -H "X-API-Key: your-secret-key" http://your-server:7077/https://example.com

The health endpoint (GET /) is always accessible without auth so you can check service status. Without ALEMBIC_API_KEY, no auth is required (default — local dev mode).


Docker

# Pull and run
docker compose up -d

# Or build from source
docker build -t alembic-proxy .
docker run -p 7077:7077 -e ALEMBIC_API_KEY=secret alembic-proxy

See docs/DEPLOYMENT.md for Fly.io, Railway, and Render deployment guides.


Health Check

curl -H "Accept: application/json" http://localhost:7077/

# → {"status": "ok", "version": "1.13.0", "cache": "active"}

Plain text health check:

curl http://localhost:7077/
# → Alembic Proxy v1.13.0

CLI Reference

Command Purpose
alembic <url> Fetch and print clean content
alembic fetch <url> --stats Fetch with token savings report
alembic batch <urls…> Fetch multiple URLs in parallel
alembic search "query" Web search via Brave / Google
alembic search "query" --fetch Search + distil + synthesise
alembic serve Start the HTTP proxy on localhost:7077
alembic clear Clear the entire cache
alembic clear-url <url> Evict a single URL from cache
alembic vacuum Remove expired entries, reclaim disk space
alembic lifetime Show lifetime token savings stats

Key flags for fetch:

Flag Effect
--format markdown|json|text Output format (default: markdown)
--stats Print token savings report
--no-cache Bypass cache, always refetch
--js Use Playwright for JS rendering
--auto-js Auto-escalate to JS if page is heavily dynamic
--saas firecrawl|browserless Use cloud rendering
-H "Key: Value" Forward custom header to target site
--ls key=value Inject into browser localStorage
--ss key=value Inject into sessionStorage

Advanced Usage

Authenticated SPA (localStorage injection)

curl "http://localhost:7077/https://app.example.com/dashboard?js=true" \
  -H "X-Alembic-LocalStorage: session_token=eyJ..."

JSON API with JMESPath filtering

curl "http://localhost:7077/https://api.example.com/users" \
  -H "Authorization: Bearer token" \
  -H "X-Alembic-JQ: data[*].email"

Python API

from src.processor import Processor
from src.config import DEFAULT_CONFIG
import asyncio

processor = Processor(DEFAULT_CONFIG)
result = asyncio.run(processor.process("https://example.com", fmt="markdown"))

print(result.content)          # clean Markdown
print(result.strategy)         # trafilatura / llms.txt / hydration / etc.
print(result.original_tokens)  # token count before
print(result.clean_tokens)     # token count after
print(result.page_type)        # article / recipe / forum / product / unknown
print(result.author)           # extracted author (if available)
print(result.publish_date)     # extracted date (if available)

Development

make test          # unit tests (685 passing)
make install-daemon  # install as launchd service (macOS)
# Run tests directly
pytest tests/ -q
# Expected: 685+ passing, 0 failures

# Live integration tests (30 URLs, real proxy)
pytest tests/integration/ -q

Documentation

File Contents
llms.txt AI-to-AI quick reference — comprehensive, machine-readable
docs/API.md Complete CLI, proxy, and Python API reference
docs/ARCHITECTURE.md System design, data flow, module map
docs/GUIDE.md Integration patterns and practical recipes
docs/CHANGELOG.md Version history

License

MIT. See LICENSE for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alembic_proxy-1.61.0.tar.gz (244.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alembic_proxy-1.61.0-py3-none-any.whl (100.1 kB view details)

Uploaded Python 3

File details

Details for the file alembic_proxy-1.61.0.tar.gz.

File metadata

  • Download URL: alembic_proxy-1.61.0.tar.gz
  • Upload date:
  • Size: 244.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alembic_proxy-1.61.0.tar.gz
Algorithm Hash digest
SHA256 ea170889c7a913820e2b7ebd5a6dc26a7a1af90389c0532d713eeded3cdeb1db
MD5 2a58fc99ea7c325c03b5e2b25262bbb0
BLAKE2b-256 43db945bdff4b30363cd7e12bf5cc5ab5aa49d5efefafce5f2fa1181b13441a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for alembic_proxy-1.61.0.tar.gz:

Publisher: publish.yml on InunuNet/Alembic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file alembic_proxy-1.61.0-py3-none-any.whl.

File metadata

  • Download URL: alembic_proxy-1.61.0-py3-none-any.whl
  • Upload date:
  • Size: 100.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alembic_proxy-1.61.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e5e31a76430ae59561b0639fd3bd0e9e8339ba73535ebe06a5c76ab25642a3ae
MD5 b1825363f43e0fcbffae919710c1c14a
BLAKE2b-256 09ac6d4652427274bb3764fad35bbe3d23fb795f3e351bf0223aa126b8d930a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for alembic_proxy-1.61.0-py3-none-any.whl:

Publisher: publish.yml on InunuNet/Alembic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page