Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

InunuNet

These details have not been verified by PyPI

Project description

Alembic

Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.

Alembic is a local HTTP proxy and CLI that sits between your agent and the open web. It fetches a URL, strips navigation, ads, scripts, and boilerplate through a multi-stage extraction cascade, and returns clean LLM-ready Markdown — typically at 84–98% token reduction. It also rewrites registry/documentation URLs to their LLM-optimised equivalents, extracts PDF text, and searches the web via Brave Search. Everything runs locally; no API keys required for basic use.

Named for the alchemical distillation apparatus — we turn raw web pages into the pure essence an agent needs.

Install

pip install alembic-proxy

Or from source:

git clone https://github.com/InunuNet/Alembic.git
cd Alembic
pip install -e .

Quick Start

# Start the proxy (recommended for agent workflows)
alembic serve

# Distil a URL — returns clean Markdown
curl http://localhost:7077/https://example.com

# Search + distil + synthesise
curl "http://localhost:7077/?q=python+async+patterns&fetch=true"

# JSON response with full metadata
curl -H "Accept: application/json" http://localhost:7077/https://example.com

# CLI fetch with token savings report
alembic fetch https://example.com --stats

Features

URL Normalization

Before fetching, Alembic rewrites certain "human-readable" URLs to their LLM-optimised equivalents — documentation hosts and structured APIs instead of noisy HTML pages:

From	To	Quality gain
`arxiv.org/pdf/{id}`	`arxiv.org/abs/{id}`	PDF → clean abstract (quality 0→85)
`github.com/{owner}/{repo}/blob/{branch}/{file}`	`raw.githubusercontent.com/...`	HTML → raw file (quality 35→84)
`hex.pm/packages/{name}`	`hexdocs.pm/{name}/`	Version list → docs (quality 30→88)
`rubygems.org/gems/{name}`	`rubydoc.info/gems/{name}/`	Version list → docs (quality 35→93)
`crates.io/crates/{name}`	`docs.rs/{name}/`	og-description → API docs (quality 85→100)
`formulae.brew.sh/formula/{name}`	`formulae.brew.sh/api/formula/{name}.json`	HTML → JSON API (quality 52→85)
`formulae.brew.sh/cask/{name}`	`formulae.brew.sh/api/cask/{name}.json`	HTML → JSON API
`npmjs.com/package/{name}`	`registry.npmjs.org/{name}/latest`	HTML → JSON API (quality 35→85); supports `@scope/pkg`
`opam.ocaml.org/packages/{name}`	`ocaml.org/p/{name}/latest`	HTML → docs (quality 33→88)
`mvnrepository.com/artifact/{G}/{A}`	`search.maven.org/artifact/{G}/{A}`	HTML → Maven Central (quality 12→70)
`gopkg.in/{package}`	`pkg.go.dev/gopkg.in/{package}`	install page → full API docs (quality 64→100)
`swiftpackageindex.com/{owner}/{name}`	`github.com/{owner}/{name}`	SPI page → GitHub llms.txt (quality 55→100)
`lib.rs/crates/{name}`	`docs.rs/{name}/`	browser page → full API docs (quality 35→100)
`cran.r-project.org/package={name}`	`rdocumentation.org/packages/{name}`	link-heavy → clean R docs (quality 55→100)
`clojars.org/{group}/{artifact}`	`clojars.org/api/artifacts/{group}/{artifact}`	HTML → JSON API (quality 69→85)
`pypi.org/project/{name}`	`pypi.org/pypi/{name}/json`	HTML → JSON API (quality 81→85; 46 → 6K-160K words)
`registry.terraform.io/providers/{ns}/{type}`	`registry.terraform.io/v1/providers/{ns}/{type}`	HTML → JSON API (10 → 43K words)
`registry.terraform.io/modules/{ns}/{name}/{provider}`	`registry.terraform.io/v1/modules/{ns}/{name}/{provider}`	HTML → JSON API (10 → 52K words)

Extraction Cascade

Every URL goes through a cascade. Alembic stops at the first stage that produces clean content:

Stage	Strategy	What it handles
Pre	URL normalization	Registry/doc URL rewrites (see table above)
Pre	arXiv abstract adapter	`arxiv.org/abs/{id}` → title + authors + abstract via lxml (quality 85)
Pre	PDF extraction	`application/pdf` → text via pypdf (5MB limit, encrypted/scanned = fallback)
0a	Sitemap adapter	XML sitemap / sitemap index → clean URL list
0b	RSS/Atom adapter	Feeds → structured Markdown with title + items
0c	Page-type adapters	Recipes, forums (Lobste.rs, Reddit, HN, SE), products
0d	SVG adapter	`image/svg+xml` → title + desc + text nodes
0e	Code-file detection	`text/plain` with `.py/.ts/.go/.rs/.yaml/.json` ext → fenced code block
1	`llms.txt` discovery	Sites with pre-built LLM index (+ URL-targeted excerpt; falls through if < 25 quality or < 50 words)
1.5	Hydration extraction	Next.js `__NEXT_DATA__`, Nuxt 3, Remix — SSR state without Playwright
1.8	JSON-LD `articleBody`	Articles, HowTo, FAQPage, Event, Course embedded in structured data
2	Content negotiation	Servers that return `text/markdown` natively
3	Trafilatura	Production article extractor — handles most pages
4	Readability	Mozilla's DOM scoring — unusual layouts
5	FitCleaner	Heuristic block scoring — dev docs and engineering blogs
6	og:description fallback	When thin extraction < 50 words + og:description ≥ 30 chars
7	Fallback	Basic tag stripping — always succeeds

strategy: llms.txt = best possible result. strategy: fallback or og-description = yellow flag — JS-heavy SPA or paywall; check X-Alembic-JS-Hint-Score.

Bot Protection Bypass

Alembic ships a pool of 50 correlated synthetic browser personas — each with a consistent OS, browser version, screen, GPU, timezone, and language fingerprint. The curl_cffi fetch stage uses TLS impersonation matched to the persona's browser version to defeat Cloudflare Bot Management at the JA3/JA4 layer.

Stage 5 (optional) adds patchright (stealth Playwright) + a residential proxy for DataDome and Akamai Bot Manager.

Site	Protection	Result
AllRecipes	Cloudflare	Passes via curl_cffi
Reuters	Cloudflare	Passes via curl_cffi
Leboncoin	DataDome	Blocked (Stage 5 target)
Glassdoor	Cloudflare Bot Management	Blocked (Stage 5 target)

Enable Stage 5:

ALEMBIC_STEALTH=1 ALEMBIC_PROXY_URL=http://user:pass@proxy:port alembic serve

JSON API Distillation

Pass a JMESPath filter to extract fields from JSON APIs without writing glue code:

# Filter a JSON API response
curl "http://localhost:7077/https://api.example.com/users" \
  -H "X-Alembic-JQ: data[*].email"

Invalid expressions return HTTP 400 with {"error": "...", "expression": "..."}.

Response Headers

Every response carries telemetry headers:

Header	Value
`X-Alembic-Strategy`	Extraction strategy: `trafilatura`, `llms.txt`, `llms.txt:excerpt`, `hydration-*`, `rss-feed`, `sitemap`, `svg-text`, `code-file`, `adapter:arxiv`, `pdf-text`, `pdf-unsupported`, `json-passthrough`, `json-jmespath`, `og-description`, `plain-text`, `fallback`
`X-Alembic-Page-Type`	`article`, `recipe`, `forum`, `product`, `api`, `youtube`, `unknown`
`X-Alembic-Title`	Extracted page title
`X-Alembic-Author`	Extracted author (if available)
`X-Alembic-Date`	Extracted publish date (if available)
`X-Alembic-Language`	Page language as BCP-47 primary subtag (`en`, `fr`, `de`, …). Empty when unknown.
`X-Alembic-Word-Count`	Word count of the clean extracted content
`X-Alembic-Link-Count`	Number of unique links extracted from the page (available in JSON envelope as `links[{url,text}]`)
`X-Alembic-JS-Hint`	`true` if the page shows strong JavaScript-rendering signals
`X-Alembic-JS-Hint-Score`	JS hint confidence score 0–10. ≥6 = likely SPA, retry with `?js=true`
`X-Alembic-Cached`	`true` / `false`
`X-Alembic-Original-Tokens`	Token count before extraction
`X-Alembic-Clean-Tokens`	Token count after extraction
`X-Alembic-Saved-Pct`	Percentage of raw tokens saved (e.g. `93%`)
`X-Alembic-Yield-Pct`	Percentage of raw tokens in clean output. Low yield (< 1%) = likely SPA or paywall — use `X-Alembic-JS-Hint-Score` to decide whether to retry with JS
`X-Alembic-Quality-Score`	0–100 content quality score. 80+ = clean prose/docs; 45–79 = moderate; 0–19 = challenge page or empty
`X-Alembic-Blocked`	`true` if a bot-wall interstitial was detected
`X-Alembic-Blocked-By`	Blocker name: `cloudflare`, `datadome`, `perimeterx`, `incapsula`, `kasada`, `bot_wall`, `unknown`
`X-Alembic-Retry`	`1` if a second persona was tried automatically after first block
`X-Alembic-Upstream-Status`	HTTP status from the upstream server (4xx/5xx only)
`X-Alembic-Wait-Status`	Playwright wait-for-selector outcome
`X-Alembic-Search-Backend`	`brave` / `searxng`
`X-Alembic-Search-Count`	Number of search results returned

Configuration

Variable	Default	Purpose
`ALEMBIC_PROXY_URL`	—	Outbound proxy for fetch requests (`http://user:pass@host:port` or `socks5://…`)
`ALEMBIC_STEALTH`	`0`	Set to `1` to enable Stage 5 patchright stealth browser
`ALEMBIC_RATE_LIMIT_RPM`	`0`	Per-IP rate limit in requests/minute. `0` = disabled. Exceeded requests get HTTP 429 + `Retry-After` header. Health endpoint exempt.
`ALEMBIC_BLOCK_RETRY`	`1`	Automatically retry once with a fresh browser persona when a block is detected. Defeats probabilistic ML scoring (Cloudflare BM). Set to `0` to disable.
`ALEMBIC_SEARXNG_URL`	—	SearXNG instance URL for web search (e.g. `http://localhost:8080`). Takes priority over Brave when set.
`ALEMBIC_SEARCH_BRAVE_API_KEY`	—	Brave Search API key (2,000 queries/month free)
`BRAVE_SEARCH_API_KEY`	—	Alias for the Brave API key
`ANTHROPIC_API_KEY`	—	Claude Haiku for search synthesis (`?fetch=true`)
`FIRECRAWL_API_KEY`	—	Firecrawl SaaS JS rendering
`BROWSERLESS_API_TOKEN`	—	Browserless SaaS JS rendering
`GITHUB_TOKEN`	—	GitHub personal access token → 5,000 req/hr on `api.github.com` (default: 60/hr unauthenticated)
`SEMANTIC_SCHOLAR_API_KEY`	—	Semantic Scholar API key → per-key rate limit on `api.semanticscholar.org`
`HF_TOKEN`	—	HuggingFace token → private model access + higher rate limits on `huggingface.co`

Authentication

When deploying Alembic as a public service, set ALEMBIC_API_KEY to require authentication:

export ALEMBIC_API_KEY=your-secret-key
alembic serve

Clients must then include one of:

curl -H "Authorization: Bearer your-secret-key" http://your-server:7077/https://example.com
curl -H "X-API-Key: your-secret-key" http://your-server:7077/https://example.com

The health endpoint (GET /) is always accessible without auth so you can check service status. Without ALEMBIC_API_KEY, no auth is required (default — local dev mode).

Docker

# Pull and run
docker compose up -d

# Or build from source
docker build -t alembic-proxy .
docker run -p 7077:7077 -e ALEMBIC_API_KEY=secret alembic-proxy

See docs/DEPLOYMENT.md for Fly.io, Railway, and Render deployment guides.

Health Check

curl -H "Accept: application/json" http://localhost:7077/

# → {"status": "ok", "version": "1.13.0", "cache": "active"}

Plain text health check:

curl http://localhost:7077/
# → Alembic Proxy v1.13.0

CLI Reference

Command	Purpose
`alembic <url>`	Fetch and print clean content
`alembic fetch <url> --stats`	Fetch with token savings report
`alembic batch <urls…>`	Fetch multiple URLs in parallel
`alembic search "query"`	Web search via Brave / Google
`alembic search "query" --fetch`	Search + distil + synthesise
`alembic serve`	Start the HTTP proxy on `localhost:7077`
`alembic clear`	Clear the entire cache
`alembic clear-url <url>`	Evict a single URL from cache
`alembic vacuum`	Remove expired entries, reclaim disk space
`alembic lifetime`	Show lifetime token savings stats

Key flags for fetch:

Flag	Effect
`--format markdown\|json\|text`	Output format (default: markdown)
`--stats`	Print token savings report
`--no-cache`	Bypass cache, always refetch
`--js`	Use Playwright for JS rendering
`--auto-js`	Auto-escalate to JS if page is heavily dynamic
`--saas firecrawl\|browserless`	Use cloud rendering
`-H "Key: Value"`	Forward custom header to target site
`--ls key=value`	Inject into browser localStorage
`--ss key=value`	Inject into sessionStorage

Advanced Usage

Authenticated SPA (localStorage injection)

curl "http://localhost:7077/https://app.example.com/dashboard?js=true" \
  -H "X-Alembic-LocalStorage: session_token=eyJ..."

JSON API with JMESPath filtering

curl "http://localhost:7077/https://api.example.com/users" \
  -H "Authorization: Bearer token" \
  -H "X-Alembic-JQ: data[*].email"

Python API

from src.processor import Processor
from src.config import DEFAULT_CONFIG
import asyncio

processor = Processor(DEFAULT_CONFIG)
result = asyncio.run(processor.process("https://example.com", fmt="markdown"))

print(result.content)          # clean Markdown
print(result.strategy)         # trafilatura / llms.txt / hydration / etc.
print(result.original_tokens)  # token count before
print(result.clean_tokens)     # token count after
print(result.page_type)        # article / recipe / forum / product / unknown
print(result.author)           # extracted author (if available)
print(result.publish_date)     # extracted date (if available)

Development

make test          # unit tests (685 passing)
make install-daemon  # install as launchd service (macOS)

# Run tests directly
pytest tests/ -q
# Expected: 685+ passing, 0 failures

# Live integration tests (30 URLs, real proxy)
pytest tests/integration/ -q

Documentation

File	Contents
`llms.txt`	AI-to-AI quick reference — comprehensive, machine-readable
`docs/API.md`	Complete CLI, proxy, and Python API reference
`docs/ARCHITECTURE.md`	System design, data flow, module map
`docs/GUIDE.md`	Integration patterns and practical recipes
`docs/CHANGELOG.md`	Version history

License

MIT. See LICENSE for the full text.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

InunuNet

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.61.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alembic_proxy-1.61.0.tar.gz (244.4 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alembic_proxy-1.61.0-py3-none-any.whl (100.1 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file alembic_proxy-1.61.0.tar.gz.

File metadata

Download URL: alembic_proxy-1.61.0.tar.gz
Upload date: May 20, 2026
Size: 244.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alembic_proxy-1.61.0.tar.gz
Algorithm	Hash digest
SHA256	`ea170889c7a913820e2b7ebd5a6dc26a7a1af90389c0532d713eeded3cdeb1db`
MD5	`2a58fc99ea7c325c03b5e2b25262bbb0`
BLAKE2b-256	`43db945bdff4b30363cd7e12bf5cc5ab5aa49d5efefafce5f2fa1181b13441a2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for alembic_proxy-1.61.0.tar.gz:

Publisher: publish.yml on InunuNet/Alembic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: alembic_proxy-1.61.0.tar.gz
- Subject digest: ea170889c7a913820e2b7ebd5a6dc26a7a1af90389c0532d713eeded3cdeb1db
- Sigstore transparency entry: 1584015042
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: InunuNet/Alembic@2f3a9862b4007bef4ce437e0beebe287c89de01d
- Branch / Tag: refs/tags/v1.61.0
- Owner: https://github.com/InunuNet
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2f3a9862b4007bef4ce437e0beebe287c89de01d
- Trigger Event: release

File details

Details for the file alembic_proxy-1.61.0-py3-none-any.whl.

File metadata

Download URL: alembic_proxy-1.61.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 100.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for alembic_proxy-1.61.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e5e31a76430ae59561b0639fd3bd0e9e8339ba73535ebe06a5c76ab25642a3ae`
MD5	`b1825363f43e0fcbffae919710c1c14a`
BLAKE2b-256	`09ac6d4652427274bb3764fad35bbe3d23fb795f3e351bf0223aa126b8d930a3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for alembic_proxy-1.61.0-py3-none-any.whl:

Publisher: publish.yml on InunuNet/Alembic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: alembic_proxy-1.61.0-py3-none-any.whl
- Subject digest: e5e31a76430ae59561b0639fd3bd0e9e8339ba73535ebe06a5c76ab25642a3ae
- Sigstore transparency entry: 1584015167
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: InunuNet/Alembic@2f3a9862b4007bef4ce437e0beebe287c89de01d
- Branch / Tag: refs/tags/v1.61.0
- Owner: https://github.com/InunuNet
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2f3a9862b4007bef4ce437e0beebe287c89de01d
- Trigger Event: release

alembic-proxy 1.61.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Alembic

Install

Quick Start

Features

URL Normalization

Extraction Cascade

Bot Protection Bypass

JSON API Distillation

Response Headers

Configuration

Authentication

Docker

Health Check

CLI Reference

Advanced Usage

Authenticated SPA (localStorage injection)

JSON API with JMESPath filtering

Python API

Development

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance