Pure-Python article extraction library and HTTP service - Drop-in replacement for readability-js-server

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Project description

Article Extractor

Pure-Python article extraction—extract clean content from any web page, no Node.js required.

Article Extractor provides a Python library, HTTP API server, and CLI tool for extracting main content from HTML documents (articles, blog posts, documentation) and converting it to clean Markdown or HTML.

Why Article Extractor?

Pure Python – No Node.js, no Selenium, no external APIs
Battle-tested – Uses Mozilla Readability.js scoring algorithms
Markdown output – Clean GFM for LLMs, docs, or archiving
Fast – Cached calculations, early termination, 50-150ms typical extraction
Safe – XSS-safe output via JustHTML
Flexible – Library, HTTP server, or CLI
Well-tested – 94%+ test coverage with comprehensive test suite

Installation

pip install article-extractor[server]  # HTTP server
pip install article-extractor[all]     # All features

# Or with uv (faster)
uv add article-extractor --extra server

Quick Start

As an HTTP Server

# Run in foreground
docker run -p 3000:3000 ghcr.io/pankaj28843/article-extractor:latest

# Run in daemon mode (detached)
docker run -d -p 3000:3000 --name article-extractor ghcr.io/pankaj28843/article-extractor:latest

# Or run locally with uvicorn
uvicorn article_extractor.server:app --host 0.0.0.0 --port 3000

Extract from URL:

curl -XPOST http://localhost:3000/ \
    -H "Content-Type: application/json" \
    -d'{"url": "https://en.wikipedia.org/wiki/Wikipedia"}'

Response:

{
  "url": "https://en.wikipedia.org/wiki/Wikipedia",
  "title": "Wikipedia - Wikipedia",
  "byline": null,
  "dir": "ltr",
  "content": "<div><p>Wikipedia is a free content online encyclopedia...</p></div>",
  "length": 89234,
  "excerpt": "Wikipedia is a free content online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki.",
  "siteName": null,
  "markdown": "# Wikipedia\n\nWikipedia is a free content online encyclopedia...",
  "word_count": 33414,
  "success": true
}

As a CLI Tool

# Extract from URL
article-extractor https://en.wikipedia.org/wiki/Wikipedia

# Extract from file
article-extractor --file article.html --output markdown

# Extract from stdin
echo '<html>...</html>' | article-extractor --output text

# Or via Docker
docker run --rm -it ghcr.io/pankaj28843/article-extractor:latest \
    article-extractor https://en.wikipedia.org/wiki/Wikipedia

As a Python Library

from article_extractor import extract_article, extract_article_from_url
import asyncio

# From HTML string
html = '<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>'
result = extract_article(html, url="https://en.wikipedia.org/wiki/Wikipedia")
print(result.markdown)
print(f"Extracted {result.word_count} words")

# From URL (async) - recommended for web pages
async def extract():
    result = await extract_article_from_url("https://en.wikipedia.org/wiki/Wikipedia")
    if result.success:
        print(f"Title: {result.title}")
        print(f"Words: {result.word_count}")
        print(f"Excerpt: {result.excerpt[:100]}...")
    else:
        print(f"Extraction failed: {result.error}")

asyncio.run(extract())

Docker Usage

# Run in daemon mode
docker run -d -p 3000:3000 --name article-extractor \
    --restart unless-stopped \
    ghcr.io/pankaj28843/article-extractor:latest

# Check logs
docker logs article-extractor

# Stop/start/restart
docker stop article-extractor
docker start article-extractor
docker restart article-extractor

# CLI mode (one-off extraction)
docker run --rm ghcr.io/pankaj28843/article-extractor:latest \
    article-extractor https://en.wikipedia.org/wiki/Wikipedia --output markdown

With docker-compose:

services:
  article-extractor:
    image: ghcr.io/pankaj28843/article-extractor:latest
    ports:
      - "3000:3000"
    restart: unless-stopped
    environment:
      - LOG_LEVEL=info

Test the server:

# Health check
curl http://localhost:3000/health

# Extract article
curl -XPOST http://localhost:3000/ \
    -H "Content-Type: application/json" \
    -d'{"url": "https://en.wikipedia.org/wiki/Wikipedia"}' | jq '.title'

Supported platforms: linux/amd64, linux/arm64
Available tags: latest, 0, 0.2, 0.2.0

API Reference

HTTP Endpoints

POST / – Extract article (send {"url": "..."})
GET / – Service info
GET /health – Health check
GET /docs – Interactive API docs

Python API

extract_article(html, url="", options=None) -> ArticleResult
extract_article_from_url(url, fetcher=None, options=None) -> ArticleResult

ArticleResult fields:

title – Extracted article title
content – Clean HTML content
markdown – Markdown version (GFM-compatible)
excerpt – First ~200 characters
word_count – Total words in article
success – Whether extraction succeeded
error – Error message if extraction failed
url – Original URL
author – Article author (if detected)
date_published – Publication date (if detected)
language – Content language (if detected)
warnings – List of extraction warnings

Options:

ExtractionOptions(
    min_word_count=150,
    min_char_threshold=500,
    include_images=True,
    include_code_blocks=True,
    safe_markdown=True
)

CLI

article-extractor https://en.wikipedia.org/wiki/Wikipedia  # Extract from URL
article-extractor --file article.html                      # From file
article-extractor --file article.html --output markdown    # Markdown output
article-extractor --server --port 3000                     # Start server

Use Cases

LLM/RAG pipelines – Extract clean article text for vector databases or prompts
Content archiving – Save web articles as Markdown for documentation
RSS/feed readers – Display clean article content without ads
Research tools – Batch extract articles from reading lists
Web scrapers – Get main content without parsing complex HTML

How It Works

Parse HTML – Uses JustHTML's HTML5-compliant parser
Clean document – Removes scripts, styles, navigation, footers
Find candidates – Identifies potential content containers (<article>, <main>, high-scoring divs)
Score candidates – Applies readability scoring (tag type, class/ID patterns, text density, link density)
Extract winner – Selects highest-scoring element as main content
Convert to Markdown – Transforms HTML to clean GFM-compatible Markdown

Algorithm based on Mozilla Readability.js with Python optimizations.

Configuration

Environment variables:

HOST=0.0.0.0        # Server bind address
PORT=3000           # Server port
LOG_LEVEL=info      # Logging level (debug, info, warning, error)

Production deployment:

# With multiple workers
uvicorn article_extractor.server:app \
    --host 0.0.0.0 \
    --port 3000 \
    --workers 4 \
    --log-level info

# With Docker (daemon mode)
docker run -d \
    -p 3000:3000 \
    --name article-extractor \
    --restart unless-stopped \
    -e LOG_LEVEL=info \
    ghcr.io/pankaj28843/article-extractor:latest

FAQ

JavaScript-heavy sites? Install playwright extra: pip install article-extractor[playwright]

Extraction fails? Check result.success / result.error. Common causes: login required, content too short, JavaScript rendering needed

Production-ready? Yes. Pin version: ghcr.io/pankaj28843/article-extractor:0

Rate limiting? Use reverse proxy (nginx, Caddy) or API gateway

Development

git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras
uv run pytest
uv run ruff format . && uv run ruff check --fix .

Run server: uv run uvicorn article_extractor.server:app --reload --port 3000

Structure:

src/article_extractor/
├── server.py    # FastAPI HTTP server
├── cli.py       # CLI interface  
├── extractor.py # Extraction logic
├── scorer.py    # Readability scoring
└── fetcher.py   # URL fetching

Troubleshooting

Port in use: lsof -i :3000 → uvicorn article_extractor.server:app --port 8000

Empty extraction: Check result.success, may need playwright, lower min_word_count

Playwright errors: playwright install chromium

License

MIT – see LICENSE

Acknowledgments

JustHTML – HTML5 parser
Mozilla Readability.js – Extraction algorithm
readability-js-server – API design inspiration

Built with ❤️ using pure Python. No Node.js required.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.9

Apr 19, 2026

0.5.8

Mar 25, 2026

0.5.7

Mar 25, 2026

0.5.6

Mar 5, 2026

0.5.5

Jan 20, 2026

0.5.4

Jan 19, 2026

0.5.3

Jan 9, 2026

0.5.2

Jan 8, 2026

0.5.1

Jan 7, 2026

0.5.0

Jan 7, 2026

0.4.2

Jan 6, 2026

0.4.1

Jan 4, 2026

0.4.0

Jan 3, 2026

0.3.2

Jan 2, 2026

0.3.1

Jan 2, 2026

0.3.0

Jan 2, 2026

0.2.1

Jan 1, 2026

This version

0.2.0

Jan 1, 2026

0.1.2

Jan 1, 2026

0.1.1

Dec 29, 2025

0.1.0

Dec 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.2.0.tar.gz (127.6 kB view details)

Uploaded Jan 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

article_extractor-0.2.0-py3-none-any.whl (24.6 kB view details)

Uploaded Jan 1, 2026 Python 3

File details

Details for the file article_extractor-0.2.0.tar.gz.

File metadata

Download URL: article_extractor-0.2.0.tar.gz
Upload date: Jan 1, 2026
Size: 127.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e6702b71994d7d32b3671865e05f11850f52e640ec6f53751689ec7ea3180694`
MD5	`6413dc260ede3caa073b77110ece7555`
BLAKE2b-256	`09614a624bf59d36a620076bd1d8f092559e3c22927829313c11a136076c1055`

See more details on using hashes here.

File details

Details for the file article_extractor-0.2.0-py3-none-any.whl.

File metadata

Download URL: article_extractor-0.2.0-py3-none-any.whl
Upload date: Jan 1, 2026
Size: 24.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb6ec0a75f7cab68e792ae7c6ef4c9d94701fdc9243b0a0a405e54662c94a957`
MD5	`7e379da47adb96840134384451264d9f`
BLAKE2b-256	`c0f26fd5ccd8c5b586e620353e19d9a41c796c216ef520be659c7ca4c458978e`

See more details on using hashes here.

article-extractor 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Article Extractor

Why Article Extractor?

Installation

Quick Start

As an HTTP Server

As a CLI Tool

As a Python Library

Docker Usage

API Reference

HTTP Endpoints

Python API

CLI

Use Cases

How It Works

Configuration

FAQ

Development

Troubleshooting

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes