Skip to main content

Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server

Project description

Article Extractor

PyPI version Python versions License: MIT CI codecov

Pure-Python article extraction—extract clean content from any web page, no Node.js required.

Article Extractor provides a Python library, HTTP API server, and CLI tool for extracting main content from HTML documents (articles, blog posts, documentation) and converting it to clean Markdown or HTML.

Why Article Extractor?

  • Pure Python – No Node.js, no Selenium, no external APIs
  • Battle-tested – Uses Mozilla Readability.js scoring algorithms
  • Markdown output – Clean GFM for LLMs, docs, or archiving
  • Fast – Cached calculations, early termination, 50-150ms typical extraction
  • Safe – XSS-safe output via JustHTML
  • Flexible – Library, HTTP server, or CLI
  • Well-tested – 94%+ test coverage with comprehensive test suite

Installation

pip install article-extractor[server]  # HTTP server
pip install article-extractor[all]     # All features

# Or with uv (faster)
uv add article-extractor --extra server

Quick Start

As an HTTP Server

# Run in foreground
docker run -p 3000:3000 ghcr.io/pankaj28843/article-extractor:latest

# Run in daemon mode (detached)
docker run -d -p 3000:3000 --name article-extractor ghcr.io/pankaj28843/article-extractor:latest

# Or run locally with uvicorn
uvicorn article_extractor.server:app --host 0.0.0.0 --port 3000

Extract from URL:

curl -XPOST http://localhost:3000/ \
    -H "Content-Type: application/json" \
    -d'{"url": "https://en.wikipedia.org/wiki/Wikipedia"}'

Response:

{
  "url": "https://en.wikipedia.org/wiki/Wikipedia",
  "title": "Wikipedia - Wikipedia",
  "byline": null,
  "dir": "ltr",
  "content": "<div><p>Wikipedia is a free content online encyclopedia...</p></div>",
  "length": 89234,
  "excerpt": "Wikipedia is a free content online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki.",
  "siteName": null,
  "markdown": "# Wikipedia\n\nWikipedia is a free content online encyclopedia...",
  "word_count": 33414,
  "success": true
}

As a CLI Tool

# Extract from URL
article-extractor https://en.wikipedia.org/wiki/Wikipedia

# Extract from file
article-extractor --file article.html --output markdown

# Extract from stdin
echo '<html>...</html>' | article-extractor --output text

# Or via Docker
docker run --rm -it ghcr.io/pankaj28843/article-extractor:latest \
    article-extractor https://en.wikipedia.org/wiki/Wikipedia

As a Python Library

from article_extractor import extract_article, extract_article_from_url
import asyncio

# From HTML string
html = '<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>'
result = extract_article(html, url="https://en.wikipedia.org/wiki/Wikipedia")
print(result.markdown)
print(f"Extracted {result.word_count} words")

# From URL (async) - recommended for web pages
async def extract():
    result = await extract_article_from_url("https://en.wikipedia.org/wiki/Wikipedia")
    if result.success:
        print(f"Title: {result.title}")
        print(f"Words: {result.word_count}")
        print(f"Excerpt: {result.excerpt[:100]}...")
    else:
        print(f"Extraction failed: {result.error}")

asyncio.run(extract())

Docker Usage

# Run in daemon mode
docker run -d -p 3000:3000 --name article-extractor \
    --restart unless-stopped \
    ghcr.io/pankaj28843/article-extractor:latest

# Check logs
docker logs article-extractor

# Stop/start/restart
docker stop article-extractor
docker start article-extractor
docker restart article-extractor

# CLI mode (one-off extraction)
docker run --rm ghcr.io/pankaj28843/article-extractor:latest \
    article-extractor https://en.wikipedia.org/wiki/Wikipedia --output markdown

With docker-compose:

services:
  article-extractor:
    image: ghcr.io/pankaj28843/article-extractor:latest
    ports:
      - "3000:3000"
    restart: unless-stopped
    environment:
      - LOG_LEVEL=info

Test the server:

# Health check
curl http://localhost:3000/health

# Extract article
curl -XPOST http://localhost:3000/ \
    -H "Content-Type: application/json" \
    -d'{"url": "https://en.wikipedia.org/wiki/Wikipedia"}' | jq '.title'

Supported platforms: linux/amd64, linux/arm64
Available tags: latest, 0, 0.2, 0.2.0

API Reference

HTTP Endpoints

  • POST / – Extract article (send {"url": "..."})
  • GET / – Service info
  • GET /health – Health check
  • GET /docs – Interactive API docs

Python API

extract_article(html, url="", options=None) -> ArticleResult
extract_article_from_url(url, fetcher=None, options=None) -> ArticleResult

ArticleResult fields:

  • title – Extracted article title
  • content – Clean HTML content
  • markdown – Markdown version (GFM-compatible)
  • excerpt – First ~200 characters
  • word_count – Total words in article
  • success – Whether extraction succeeded
  • error – Error message if extraction failed
  • url – Original URL
  • author – Article author (if detected)
  • date_published – Publication date (if detected)
  • language – Content language (if detected)
  • warnings – List of extraction warnings

Options:

ExtractionOptions(
    min_word_count=150,
    min_char_threshold=500,
    include_images=True,
    include_code_blocks=True,
    safe_markdown=True
)

CLI

article-extractor https://en.wikipedia.org/wiki/Wikipedia  # Extract from URL
article-extractor --file article.html                      # From file
article-extractor --file article.html --output markdown    # Markdown output
article-extractor --server --port 3000                     # Start server

Use Cases

  • LLM/RAG pipelines – Extract clean article text for vector databases or prompts
  • Content archiving – Save web articles as Markdown for documentation
  • RSS/feed readers – Display clean article content without ads
  • Research tools – Batch extract articles from reading lists
  • Web scrapers – Get main content without parsing complex HTML

How It Works

  1. Parse HTML – Uses JustHTML's HTML5-compliant parser
  2. Clean document – Removes scripts, styles, navigation, footers
  3. Find candidates – Identifies potential content containers (<article>, <main>, high-scoring divs)
  4. Score candidates – Applies readability scoring (tag type, class/ID patterns, text density, link density)
  5. Extract winner – Selects highest-scoring element as main content
  6. Convert to Markdown – Transforms HTML to clean GFM-compatible Markdown

Algorithm based on Mozilla Readability.js with Python optimizations.

Configuration

Environment variables:

HOST=0.0.0.0        # Server bind address
PORT=3000           # Server port
LOG_LEVEL=info      # Logging level (debug, info, warning, error)

Production deployment:

# With multiple workers
uvicorn article_extractor.server:app \
    --host 0.0.0.0 \
    --port 3000 \
    --workers 4 \
    --log-level info

# With Docker (daemon mode)
docker run -d \
    -p 3000:3000 \
    --name article-extractor \
    --restart unless-stopped \
    -e LOG_LEVEL=info \
    ghcr.io/pankaj28843/article-extractor:latest

FAQ

JavaScript-heavy sites? Install playwright extra: pip install article-extractor[playwright]

Extraction fails? Check result.success / result.error. Common causes: login required, content too short, JavaScript rendering needed

Production-ready? Yes. Pin version: ghcr.io/pankaj28843/article-extractor:0

Rate limiting? Use reverse proxy (nginx, Caddy) or API gateway

Development

git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras
uv run pytest
uv run ruff format . && uv run ruff check --fix .

Run server: uv run uvicorn article_extractor.server:app --reload --port 3000

Structure:

src/article_extractor/
├── server.py    # FastAPI HTTP server
├── cli.py       # CLI interface  
├── extractor.py # Extraction logic
├── scorer.py    # Readability scoring
└── fetcher.py   # URL fetching

Troubleshooting

Port in use: lsof -i :3000uvicorn article_extractor.server:app --port 8000

Empty extraction: Check result.success, may need playwright, lower min_word_count

Playwright errors: playwright install chromium

License

MIT – see LICENSE

Acknowledgments


Built with ❤️ using pure Python. No Node.js required.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.2.0.tar.gz (127.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

article_extractor-0.2.0-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file article_extractor-0.2.0.tar.gz.

File metadata

  • Download URL: article_extractor-0.2.0.tar.gz
  • Upload date:
  • Size: 127.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e6702b71994d7d32b3671865e05f11850f52e640ec6f53751689ec7ea3180694
MD5 6413dc260ede3caa073b77110ece7555
BLAKE2b-256 09614a624bf59d36a620076bd1d8f092559e3c22927829313c11a136076c1055

See more details on using hashes here.

File details

Details for the file article_extractor-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: article_extractor-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb6ec0a75f7cab68e792ae7c6ef4c9d94701fdc9243b0a0a405e54662c94a957
MD5 7e379da47adb96840134384451264d9f
BLAKE2b-256 c0f26fd5ccd8c5b586e620353e19d9a41c796c216ef520be659c7ca4c458978e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page