Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Project description

Article Extractor

Python versions

High-fidelity article extraction in pure Python: library, HTTP API, and CLI that turn messy web pages into clean Markdown or HTML for ingestion, archiving, and LLM pipelines.

Requires Python 3.12+

Who This Helps

Backend, data, and tooling teams that need reliable article text for search, RAG, or analytics.
Engineers who prefer a single-language stack with fast installs and reproducible results.
Teams that want a ready-to-ship server/CLI and a composable Python API.

Quick Start (pick one)

CLI (fastest)

pip install article-extractor
article-extractor https://en.wikipedia.org/wiki/Wikipedia --output markdown

You will see clean Markdown printed to stdout along with detected title, excerpt, and word count.

Server (Docker)

docker run -p 3000:3000 ghcr.io/pankaj28843/article-extractor:latest
curl -XPOST http://localhost:3000/ \
    -H "Content-Type: application/json" \
    -d '{"url": "https://en.wikipedia.org/wiki/Wikipedia"}' | jq '.title, .word_count'

Override cache + Playwright storage in Docker

docker run -e lets you pass environment variables into the container so you can raise or lower the LRU cache limit (ARTICLE_EXTRACTOR_CACHE_SIZE, ARTICLE_EXTRACTOR_THREADPOOL_SIZE, etc.) without rebuilding images Docker container run – env #techdocs.
Use -v/--volume to mount host directories and persist assets like the Playwright storage-state file between runs Docker container run – volume #techdocs.
ARTICLE_EXTRACTOR_STORAGE_STATE_FILE is a project-scoped alias for the legacy PLAYWRIGHT_STORAGE_STATE_FILE. Set either one (alias wins) to keep cookies/session data on a mounted volume. ARTICLE_EXTRACTOR_PREFER_PLAYWRIGHT (defaults to true) controls which fetcher the server prefers when both Playwright and httpx are installed.
Example: keep Playwright cookies on the host, force Playwright as the default fetcher, and increase the cache to 2k entries while running the published image:

docker run --rm -p 3000:3000 \
    -e ARTICLE_EXTRACTOR_CACHE_SIZE=2000 \
    -e ARTICLE_EXTRACTOR_STORAGE_STATE_FILE=/data/storage-state.json \
    -e ARTICLE_EXTRACTOR_PREFER_PLAYWRIGHT=true \
    -v $HOME/.article-extractor:/data \
    ghcr.io/pankaj28843/article-extractor:latest

Mounting the host directory ensures /data/storage-state.json survives container restarts, so Playwright-headed sessions can defeat bot checks once and reuse the same cookies later.

Python Library

from article_extractor import extract_article

html = """
<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>
"""

result = extract_article(html, url="https://example.com/demo")
print(result.title)
print(result.markdown.splitlines()[0])
print(f"Words: {result.word_count}")

Why Teams Choose Article Extractor

Accuracy first: Readability-style scoring tuned for long-form content and docs.
Clean output: GFM-ready Markdown and sanitised HTML safe for downstream pipelines.
Speed at scale: Caching plus early-termination heuristics keep typical pages in 50–150 ms.
Drop-in everywhere: Same engine across CLI, HTTP server, and Python API.
Test coverage that catches regressions before you do.

Installation

pip install article-extractor[server]  # HTTP server
pip install article-extractor[all]     # All optional extras

# Or with uv (fast installs)
uv add article-extractor --extra server

HTTP Server

# Local
uvicorn article_extractor.server:app --host 0.0.0.0 --port 3000

# Docker
docker run -d -p 3000:3000 --name article-extractor \
    --restart unless-stopped ghcr.io/pankaj28843/article-extractor:latest

Endpoints:

POST / — Extract article ({"url": "..."})
GET / — Service info
GET /health — Health check
GET /docs — Interactive OpenAPI UI

CLI

# Extract from URL
article-extractor https://en.wikipedia.org/wiki/Wikipedia

# From file
article-extractor --file article.html --output markdown

# One-off via Docker
docker run --rm ghcr.io/pankaj28843/article-extractor:latest \
    article-extractor https://en.wikipedia.org/wiki/Wikipedia --output text

Networking controls (CLI & Server)

Honor corporate proxies automatically: HTTP_PROXY, HTTPS_PROXY, ALL_PROXY, and NO_PROXY are folded into every fetcher, or override them with --proxy=http://user:pass@proxy:8080 and optional --prefer-httpx/--prefer-playwright flags.
Rotate or pin User-Agents: pass --user-agent for deterministic runs or --random-user-agent to synthesize realistic desktop headers with fake-useragent (pip install article-extractor[ua] or use the all/server extras). Disable later via --no-random-user-agent.
Handle bot challenges: --headed --user-interaction-timeout 30 launches Chromium with a visible window, pauses for manual CAPTCHA solving, and persists cookies to ~/.article-extractor/storage_state.json (override with --storage-state).
Docker and server mode forward the same defaults. When running the FastAPI server, the CLI seeds these values so all requests inherit them unless the POST body supplies {"prefer_playwright": false, "network": {"proxy": "http://", "headed": true, ...}}.

Server POST example with overrides:

{
    "url": "https://example.com/paywalled",
    "prefer_playwright": true,
    "network": {
        "user_agent": "MyMonitor/1.0",
        "random_user_agent": false,
        "proxy": "http://proxy.internal:8080",
        "proxy_bypass": ["metadata.internal"],
        "headed": true,
        "user_interaction_timeout": 25,
        "storage_state": "/var/lib/article-extractor/storage_state.json"
    }
}

Python API

from article_extractor import extract_article, extract_article_from_url, ExtractionOptions

options = ExtractionOptions(min_word_count=120, include_images=True)
result = extract_article("<html>...</html>", url="https://example.com", options=options)

print(result.title)
print(result.excerpt)
print(result.success)

ArticleResult fields: title, content, markdown, excerpt, word_count, success, error, url, author, date_published, language, warnings.

Use Cases

LLM and RAG ingestion with clean Markdown ready for embeddings.
Content archiving and doc syncing without ads or cruft.
RSS/feed readers and knowledge tools that need readable HTML.
Research pipelines that batch-extract large reading lists.

How It Works

Parse HTML with JustHTML.
Strip noise (scripts, nav, styles) and find content candidates.
Score candidates with a Readability-inspired model (density, link ratio, structure hints).
Pick the winner, normalise headings/links, and emit clean HTML.
Convert to GFM-compatible Markdown for downstream tools.

Configuration

HOST=0.0.0.0
PORT=3000
LOG_LEVEL=info
WEB_CONCURRENCY=2
ARTICLE_EXTRACTOR_CACHE_SIZE=1000
ARTICLE_EXTRACTOR_THREADPOOL_SIZE=0
ARTICLE_EXTRACTOR_PREFER_PLAYWRIGHT=true
ARTICLE_EXTRACTOR_STORAGE_STATE_FILE=/data/storage-state.json  # alias for Playwright storage path
# Legacy equivalent (still supported):
# PLAYWRIGHT_STORAGE_STATE_FILE=/data/storage-state.json

Troubleshooting

JavaScript-heavy sites: pip install article-extractor[playwright] and try again.
Empty or short output: lower min_word_count or inspect result.warnings.
Ports busy: lsof -i :3000 then restart with --port 8000.

Development

git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras
uv run ruff format . && uv run ruff check --fix .
uv run pytest tests/ -v

Contributing

See CONTRIBUTING.md for setup, coding standards, and the full validation loop. PRs with tests and doc improvements are welcome.

License

MIT — see LICENSE

Acknowledgments

JustHTML — HTML5 parser
Mozilla Readability.js — Extraction algorithm
readability-js-server — API design inspiration

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.9

Apr 19, 2026

0.5.8

Mar 25, 2026

0.5.7

Mar 25, 2026

0.5.6

Mar 5, 2026

0.5.5

Jan 20, 2026

0.5.4

Jan 19, 2026

0.5.3

Jan 9, 2026

0.5.2

Jan 8, 2026

0.5.1

Jan 7, 2026

0.5.0

Jan 7, 2026

0.4.2

Jan 6, 2026

0.4.1

Jan 4, 2026

0.4.0

Jan 3, 2026

This version

0.3.2

Jan 2, 2026

0.3.1

Jan 2, 2026

0.3.0

Jan 2, 2026

0.2.1

Jan 1, 2026

0.2.0

Jan 1, 2026

0.1.2

Jan 1, 2026

0.1.1

Dec 29, 2025

0.1.0

Dec 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.3.2.tar.gz (116.9 kB view details)

Uploaded Jan 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

article_extractor-0.3.2-py3-none-any.whl (31.7 kB view details)

Uploaded Jan 2, 2026 Python 3

File details

Details for the file article_extractor-0.3.2.tar.gz.

File metadata

Download URL: article_extractor-0.3.2.tar.gz
Upload date: Jan 2, 2026
Size: 116.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`abb15a8a6259355bdd8d66e30e54c3040c8f32d1a1277eb92bbc270b0500a18f`
MD5	`4a3cd71c81abb7af6d7c2f27e95aa1c8`
BLAKE2b-256	`cf1d4f1ad7764e038af1f74bed19b2156342e9341823719666a6fcdb096f2808`

See more details on using hashes here.

File details

Details for the file article_extractor-0.3.2-py3-none-any.whl.

File metadata

Download URL: article_extractor-0.3.2-py3-none-any.whl
Upload date: Jan 2, 2026
Size: 31.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2459194f7d85ade1f527bb1799267c624da1dc515aa9e1a724939e306877ba6`
MD5	`22950ddc82574051092d1ee56b150d26`
BLAKE2b-256	`46a19a172010a22f97762bfbfbf390835c4ad2b44f037cb4b985c060ee0f1e71`

See more details on using hashes here.

article-extractor 0.3.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Article Extractor

Who This Helps

Quick Start (pick one)

CLI (fastest)

Server (Docker)

Override cache + Playwright storage in Docker

Python Library

Why Teams Choose Article Extractor

Installation

HTTP Server

CLI

Networking controls (CLI & Server)

Python API

Use Cases

How It Works

Configuration

Troubleshooting

Development

Contributing

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes