Pure-Python article extraction library using Readability-style scoring
Project description
Article Extractor
Extract the content you care about from any web page—no JavaScript runtime, no external services, just Python.
Article Extractor pulls the main content from HTML documents (articles, blog posts, documentation) and converts it to clean Markdown. If you've ever wanted the "Reader Mode" experience in your Python code, this is it.
Why Article Extractor?
- Pure Python – No Node.js, no Selenium, no external APIs. Install it, import it, use it.
- Battle-tested algorithms – Uses scoring techniques from Mozilla's Readability.js to identify what's actually content vs. navigation, ads, and sidebars.
- Markdown output – Get clean GitHub-Flavored Markdown, ready for LLMs, documentation, or archiving.
- Fast – Caches text calculations and uses early termination. Extracts most articles in milliseconds.
- Safe by default – XSS-safe output via JustHTML sanitization.
Who Is This For?
- LLM/AI developers building RAG pipelines or agents that need clean text from web pages
- Content archivists who want to save articles as Markdown
- Researchers scraping article text for analysis
- Anyone tired of wrangling BeautifulSoup to extract "just the article"
Installation
# From PyPI
pip install article-extractor
# With optional fetchers
pip install article-extractor[httpx] # Lightweight HTTP fetcher
pip install article-extractor[playwright] # JavaScript rendering support
pip install article-extractor[all] # All optional dependencies
# With uv
uv add article-extractor
uv add article-extractor --extra all # With all optional dependencies
Quick Start
Here's a complete example—paste this into a Python file and run it:
from article_extractor import extract_article
html = """
<html>
<body>
<nav><a href="/">Home</a></nav>
<article>
<h1>My Article</h1>
<p>This is the main content of the article...</p>
</article>
<footer>Copyright 2025</footer>
</body>
</html>
"""
result = extract_article(html, url="https://example.com/article")
print(result.title) # "My Article"
print(result.markdown) # "# My Article\n\nThis is the main content..."
print(result.word_count) # 8
print(result.success) # True
That's it. The library automatically ignores the <nav> and <footer>, extracts the <article>, and gives you clean Markdown.
Fetching from URLs
Need to fetch a page first? Use the async extract_article_from_url:
import asyncio
from article_extractor import extract_article_from_url
async def main():
result = await extract_article_from_url("https://example.com/some-article")
print(result.markdown)
asyncio.run(main())
By default, this uses Playwright if available (for JavaScript-heavy sites), falling back to httpx.
How It Works
The extraction algorithm is inspired by Mozilla's Readability.js (the engine behind Firefox Reader View):
- Parse HTML – Uses JustHTML's HTML5-compliant parser
- Clean the document – Removes scripts, styles, nav, footer, and other non-content elements
- Find candidates – Looks for
<article>,<main>, or high-scoring<div>/<section>elements - Score candidates – Each element gets a score based on:
- Tag type (
article: +5,div: +5,h1-h6: -5) - Class/ID patterns (+25 for "article", "content"; -25 for "sidebar", "footer")
- Paragraph content (+1 per comma, +1 per 100 characters)
- Link density (high link ratio = probably navigation, penalized)
- Tag type (
- Extract the winner – Takes content from the highest-scoring candidate
- Convert to Markdown – Uses JustHTML's GFM-compatible converter
API Reference
extract_article(html, url="", options=None) -> ArticleResult
Extract article content from an HTML string.
| Parameter | Type | Description |
|---|---|---|
html |
str | bytes |
HTML content to extract from |
url |
str |
Original URL (used for resolving relative links and title fallback) |
options |
ExtractionOptions | None |
Extraction configuration (see below) |
Returns: ArticleResult with extracted content.
extract_article_from_url(url, fetcher=None, options=None, *, prefer_playwright=True) -> ArticleResult
Async function to fetch a URL and extract article content.
| Parameter | Type | Description |
|---|---|---|
url |
str |
URL to fetch |
fetcher |
Fetcher | None |
Custom fetcher (auto-creates if not provided) |
options |
ExtractionOptions | None |
Extraction configuration |
prefer_playwright |
bool |
Prefer Playwright over httpx when auto-creating (default: True) |
ArticleResult
The result object returned by extraction functions:
@dataclass
class ArticleResult:
url: str # Original URL
title: str # Extracted title
content: str # Cleaned HTML content
markdown: str # Markdown conversion
excerpt: str # Short excerpt (~200 chars)
word_count: int # Word count of extracted content
success: bool # Whether extraction succeeded
error: str | None = None # Error message if failed
author: str | None = None # Author (if found)
date_published: str | None = None # Publication date (if found)
language: str | None = None # Document language (if detected)
warnings: list[str] = [] # Non-fatal warnings
ExtractionOptions
Configure extraction behavior:
@dataclass
class ExtractionOptions:
min_word_count: int = 150 # Minimum words for valid content
min_char_threshold: int = 500 # Minimum chars for candidate consideration
include_images: bool = True # Include images in output
include_code_blocks: bool = True # Include code blocks
safe_markdown: bool = True # XSS-safe output (recommended)
FAQ
Q: Why not just use BeautifulSoup?
BeautifulSoup parses HTML, but doesn't know what's "content" vs. "navigation." You'd need to write heuristics yourself. Article Extractor has those heuristics built in.
Q: Does this work on JavaScript-heavy sites?
If you install the playwright extra, yes. The async fetcher will render the page with a real browser before extraction.
Q: What if extraction fails?
Check result.success and result.error. Common issues: page is behind a login, content is too short (below min_word_count), or the site uses an unusual structure.
Q: Can I customize what gets extracted?
Use ExtractionOptions to tune thresholds. For more control, you can also pre-process the HTML before passing it to extract_article.
Development
Contributions are welcome! Here's how to get set up:
# Clone and install
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras
# Run tests
uv run pytest
# Format and lint
uv run ruff format .
uv run ruff check --fix .
# Run with coverage
uv run pytest --cov
Project Structure
src/article_extractor/
├── __init__.py # Public API (extract_article, extract_article_from_url)
├── extractor.py # Main extraction logic
├── scorer.py # Readability-style scoring algorithm
├── fetcher.py # URL fetching (httpx, Playwright)
├── cache.py # Text calculation caching
├── constants.py # Scoring weights, tag lists
├── types.py # Data classes (ArticleResult, ExtractionOptions)
└── utils.py # Helper functions
License
MIT License – see LICENSE for details.
Acknowledgments
- JustHTML – Pure Python HTML5 parser and Markdown converter
- Mozilla Readability.js – Scoring algorithm inspiration
- Postlight Parser – Additional scoring patterns
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_extractor-0.1.2.tar.gz.
File metadata
- Download URL: article_extractor-0.1.2.tar.gz
- Upload date:
- Size: 58.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3eb3af49579e2e3c09a6552674ce71cf73b835d12523bf9de4256c78e6a15f57
|
|
| MD5 |
e8e673ac279da3ddbb120979a40e0c3e
|
|
| BLAKE2b-256 |
35e6db12d73cf655e87322a42c8f7cd8adf43fc5d74dd8fb45c90b3155082671
|
File details
Details for the file article_extractor-0.1.2-py3-none-any.whl.
File metadata
- Download URL: article_extractor-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74403479afe46e8e0580757da58164b9bc454ecac2ca9398ec40935b27a6a39c
|
|
| MD5 |
6257e51f2abbab3e920eb807a02638a0
|
|
| BLAKE2b-256 |
6ffff8a3c352a1b578277d34c039b476d8df0c34c36a1784bd7f35e99a27fec4
|