Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server
Project description
Article Extractor
Pure-Python article extraction—extract clean content from any web page, no Node.js required.
Article Extractor provides a Python library, HTTP API server, and CLI tool for extracting main content from HTML documents (articles, blog posts, documentation) and converting it to clean Markdown or HTML.
Why Article Extractor?
- Pure Python – No Node.js, no Selenium, no external APIs
- Battle-tested – Uses Mozilla Readability.js scoring algorithms
- Markdown output – Clean GFM for LLMs, docs, or archiving
- Fast – Cached calculations, early termination, 50-150ms typical extraction
- Safe – XSS-safe output via JustHTML
- Flexible – Library, HTTP server, or CLI
- Well-tested – 94%+ test coverage with comprehensive test suite
Installation
pip install article-extractor[server] # HTTP server
pip install article-extractor[all] # All features
# Or with uv (faster)
uv add article-extractor --extra server
Quick Start
As an HTTP Server
# Run in foreground
docker run -p 3000:3000 ghcr.io/pankaj28843/article-extractor:latest
# Run in daemon mode (detached)
docker run -d -p 3000:3000 --name article-extractor ghcr.io/pankaj28843/article-extractor:latest
# Or run locally with uvicorn
uvicorn article_extractor.server:app --host 0.0.0.0 --port 3000
Extract from URL:
curl -XPOST http://localhost:3000/ \
-H "Content-Type: application/json" \
-d'{"url": "https://en.wikipedia.org/wiki/Wikipedia"}'
Response:
{
"url": "https://en.wikipedia.org/wiki/Wikipedia",
"title": "Wikipedia - Wikipedia",
"byline": null,
"dir": "ltr",
"content": "<div><p>Wikipedia is a free content online encyclopedia...</p></div>",
"length": 89234,
"excerpt": "Wikipedia is a free content online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki.",
"siteName": null,
"markdown": "# Wikipedia\n\nWikipedia is a free content online encyclopedia...",
"word_count": 33414,
"success": true
}
As a CLI Tool
# Extract from URL
article-extractor https://en.wikipedia.org/wiki/Wikipedia
# Extract from file
article-extractor --file article.html --output markdown
# Extract from stdin
echo '<html>...</html>' | article-extractor --output text
# Or via Docker
docker run --rm -it ghcr.io/pankaj28843/article-extractor:latest \
article-extractor https://en.wikipedia.org/wiki/Wikipedia
As a Python Library
from article_extractor import extract_article, extract_article_from_url
import asyncio
# From HTML string
html = '<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>'
result = extract_article(html, url="https://en.wikipedia.org/wiki/Wikipedia")
print(result.markdown)
print(f"Extracted {result.word_count} words")
# From URL (async) - recommended for web pages
async def extract():
result = await extract_article_from_url("https://en.wikipedia.org/wiki/Wikipedia")
if result.success:
print(f"Title: {result.title}")
print(f"Words: {result.word_count}")
print(f"Excerpt: {result.excerpt[:100]}...")
else:
print(f"Extraction failed: {result.error}")
asyncio.run(extract())
Docker Usage
# Run in daemon mode
docker run -d -p 3000:3000 --name article-extractor \
--restart unless-stopped \
ghcr.io/pankaj28843/article-extractor:latest
# Check logs
docker logs article-extractor
# Stop/start/restart
docker stop article-extractor
docker start article-extractor
docker restart article-extractor
# CLI mode (one-off extraction)
docker run --rm ghcr.io/pankaj28843/article-extractor:latest \
article-extractor https://en.wikipedia.org/wiki/Wikipedia --output markdown
With docker-compose:
services:
article-extractor:
image: ghcr.io/pankaj28843/article-extractor:latest
ports:
- "3000:3000"
restart: unless-stopped
environment:
- LOG_LEVEL=info
Test the server:
# Health check
curl http://localhost:3000/health
# Extract article
curl -XPOST http://localhost:3000/ \
-H "Content-Type: application/json" \
-d'{"url": "https://en.wikipedia.org/wiki/Wikipedia"}' | jq '.title'
Supported platforms: linux/amd64, linux/arm64
Available tags: latest, 0, 0.2, 0.2.0
API Reference
HTTP Endpoints
POST /– Extract article (send{"url": "..."})GET /– Service infoGET /health– Health checkGET /docs– Interactive API docs
Python API
extract_article(html, url="", options=None) -> ArticleResult
extract_article_from_url(url, fetcher=None, options=None) -> ArticleResult
ArticleResult fields:
title– Extracted article titlecontent– Clean HTML contentmarkdown– Markdown version (GFM-compatible)excerpt– First ~200 charactersword_count– Total words in articlesuccess– Whether extraction succeedederror– Error message if extraction failedurl– Original URLauthor– Article author (if detected)date_published– Publication date (if detected)language– Content language (if detected)warnings– List of extraction warnings
Options:
ExtractionOptions(
min_word_count=150,
min_char_threshold=500,
include_images=True,
include_code_blocks=True,
safe_markdown=True
)
CLI
article-extractor https://en.wikipedia.org/wiki/Wikipedia # Extract from URL
article-extractor --file article.html # From file
article-extractor --file article.html --output markdown # Markdown output
article-extractor --server --port 3000 # Start server
Use Cases
- LLM/RAG pipelines – Extract clean article text for vector databases or prompts
- Content archiving – Save web articles as Markdown for documentation
- RSS/feed readers – Display clean article content without ads
- Research tools – Batch extract articles from reading lists
- Web scrapers – Get main content without parsing complex HTML
How It Works
- Parse HTML – Uses JustHTML's HTML5-compliant parser
- Clean document – Removes scripts, styles, navigation, footers
- Find candidates – Identifies potential content containers (
<article>,<main>, high-scoring divs) - Score candidates – Applies readability scoring (tag type, class/ID patterns, text density, link density)
- Extract winner – Selects highest-scoring element as main content
- Convert to Markdown – Transforms HTML to clean GFM-compatible Markdown
Algorithm based on Mozilla Readability.js with Python optimizations.
Configuration
Environment variables:
HOST=0.0.0.0 # Server bind address
PORT=3000 # Server port
LOG_LEVEL=info # Logging level (debug, info, warning, error)
Production deployment:
# With multiple workers
uvicorn article_extractor.server:app \
--host 0.0.0.0 \
--port 3000 \
--workers 4 \
--log-level info
# With Docker (daemon mode)
docker run -d \
-p 3000:3000 \
--name article-extractor \
--restart unless-stopped \
-e LOG_LEVEL=info \
ghcr.io/pankaj28843/article-extractor:latest
FAQ
JavaScript-heavy sites? Install playwright extra: pip install article-extractor[playwright]
Extraction fails? Check result.success / result.error. Common causes: login required, content too short, JavaScript rendering needed
Production-ready? Yes. Pin version: ghcr.io/pankaj28843/article-extractor:0
Rate limiting? Use reverse proxy (nginx, Caddy) or API gateway
Development
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras
uv run pytest
uv run ruff format . && uv run ruff check --fix .
Run server: uv run uvicorn article_extractor.server:app --reload --port 3000
Structure:
src/article_extractor/
├── server.py # FastAPI HTTP server
├── cli.py # CLI interface
├── extractor.py # Extraction logic
├── scorer.py # Readability scoring
└── fetcher.py # URL fetching
Troubleshooting
Port in use: lsof -i :3000 → uvicorn article_extractor.server:app --port 8000
Empty extraction: Check result.success, may need playwright, lower min_word_count
Playwright errors: playwright install chromium
License
MIT – see LICENSE
Acknowledgments
- JustHTML – HTML5 parser
- Mozilla Readability.js – Extraction algorithm
- readability-js-server – API design inspiration
Built with ❤️ using pure Python. No Node.js required.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_extractor-0.2.0.tar.gz.
File metadata
- Download URL: article_extractor-0.2.0.tar.gz
- Upload date:
- Size: 127.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6702b71994d7d32b3671865e05f11850f52e640ec6f53751689ec7ea3180694
|
|
| MD5 |
6413dc260ede3caa073b77110ece7555
|
|
| BLAKE2b-256 |
09614a624bf59d36a620076bd1d8f092559e3c22927829313c11a136076c1055
|
File details
Details for the file article_extractor-0.2.0-py3-none-any.whl.
File metadata
- Download URL: article_extractor-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb6ec0a75f7cab68e792ae7c6ef4c9d94701fdc9243b0a0a405e54662c94a957
|
|
| MD5 |
7e379da47adb96840134384451264d9f
|
|
| BLAKE2b-256 |
c0f26fd5ccd8c5b586e620353e19d9a41c796c216ef520be659c7ca4c458978e
|