Pure-Python article extraction library using Readability-style scoring
Project description
Article Extractor
Pure-Python article extraction library that extracts main content from HTML documents and converts to Markdown.
Uses JustHTML for HTML parsing and implements Readability.js-style scoring for content detection.
Features
- Pure Python - No external services or JavaScript runtime required
- Readability-style scoring - Identifies main content using proven algorithms from Mozilla's Readability.js
- Markdown output - Converts extracted content to clean GitHub-Flavored Markdown
- Fast - Caches text calculations, uses early termination optimizations
- Safe by default - XSS-safe HTML and Markdown output via JustHTML sanitization
Installation
# From PyPI
pip install article-extractor
# With optional fetchers
pip install article-extractor[httpx] # Lightweight HTTP fetcher
pip install article-extractor[playwright] # JavaScript rendering support
pip install article-extractor[all] # All optional dependencies
# With uv
uv add article-extractor
uv add article-extractor --extra all # With all optional dependencies
Quick Start
from article_extractor import extract_article
html = """
<html>
<body>
<nav><a href="/">Home</a></nav>
<article>
<h1>My Article</h1>
<p>This is the main content of the article...</p>
</article>
<footer>Copyright 2025</footer>
</body>
</html>
"""
result = extract_article(html, url="https://example.com/article")
print(result.title) # "My Article"
print(result.markdown) # "# My Article\n\nThis is the main content..."
print(result.word_count) # 8
print(result.success) # True
API Reference
extract_article(html, url="", options=None) -> ArticleResult
Extract main article content from HTML.
Parameters:
html(str | bytes): HTML content to extract fromurl(str): Original URL (used for title fallback)options(ExtractionOptions | None): Extraction configuration
Returns: ArticleResult with extracted content
extract_article_from_url(url, fetcher=None, options=None, *, prefer_playwright=True) -> ArticleResult
Async function to fetch URL and extract article content. If no fetcher is provided, auto-selects the best available fetcher.
Parameters:
url(str): URL to fetchfetcher(Fetcher | None): Object implementing theFetcherprotocol (optional - auto-creates if not provided)options(ExtractionOptions | None): Extraction configurationprefer_playwright(bool): If auto-creating fetcher, prefer Playwright (default: True)
ArticleResult
@dataclass
class ArticleResult:
url: str # Original URL
title: str # Extracted title
content: str # Cleaned HTML content
markdown: str # Markdown conversion
excerpt: str # Short excerpt (first ~200 chars)
word_count: int # Word count of extracted content
success: bool # Whether extraction succeeded
error: str | None = None # Error message if failed
author: str | None = None # Extracted author (if found)
date_published: str | None = None # Publication date (if found)
language: str | None = None # Document language (if detected)
warnings: list[str] = [] # Non-fatal warnings
ExtractionOptions
@dataclass
class ExtractionOptions:
min_word_count: int = 150 # Minimum words for valid content
min_char_threshold: int = 500 # Minimum chars for candidate consideration
include_images: bool = True # Include images in output
include_code_blocks: bool = True # Include code blocks in output
safe_markdown: bool = True # XSS-safe output (recommended)
How It Works
- Parse HTML using JustHTML's HTML5-compliant parser
- Clean document by removing scripts, styles, nav, footer, etc.
- Find candidates - Look for
<article>,<main>, or high-scoring<div>/<section>elements - Score candidates using Readability.js-style algorithm:
- Tag-based scores (article: +5, div: +5, h1-h6: -5)
- Class/ID pattern matching (+25 for "article", "content"; -25 for "sidebar", "footer")
- Paragraph content scoring (+1 per comma, +1 per 100 chars)
- Link density penalty (high link ratio = navigation)
- Extract content from top-scoring candidate
- Convert to Markdown using JustHTML's GFM-compatible converter
Development
# Clone and install
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras
# Run tests
uv run pytest
# Format and lint
uv run ruff format .
uv run ruff check --fix .
# Run with coverage
uv run pytest --cov
License
MIT License - see LICENSE for details.
Credits
- JustHTML - Pure Python HTML5 parser
- Mozilla Readability.js - Scoring algorithm inspiration
- Postlight Parser - Additional scoring patterns
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_extractor-0.1.1.tar.gz.
File metadata
- Download URL: article_extractor-0.1.1.tar.gz
- Upload date:
- Size: 56.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af1fdb68eef7aa37783546f72452c79530ca6383c372c76d77e257031ef1d112
|
|
| MD5 |
930c35e112f55e22dce7f18bae75a526
|
|
| BLAKE2b-256 |
387290a8ae9fcdcddecf28627a03e990061ab58515dc64e5fbc3f0a58d6f5e3d
|
File details
Details for the file article_extractor-0.1.1-py3-none-any.whl.
File metadata
- Download URL: article_extractor-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53a68e7fed54ad1ff3a4e6a554033d92a499279e8e87ee385081eb7af269bd04
|
|
| MD5 |
e8f6bd2c85ad3b17c9d341b63b7618f2
|
|
| BLAKE2b-256 |
482d8327b2de6bc4890348ff0073c5172e8e2c5406ca37dc51e150abc83b1de6
|