Pure-Python article extraction library using Readability-style scoring

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Project description

Article Extractor

Pure-Python article extraction library that extracts main content from HTML documents and converts to Markdown.

Uses JustHTML for HTML parsing and implements Readability.js-style scoring for content detection.

Features

Pure Python - No external services or JavaScript runtime required
Readability-style scoring - Identifies main content using proven algorithms from Mozilla's Readability.js
Markdown output - Converts extracted content to clean GitHub-Flavored Markdown
Fast - Caches text calculations, uses early termination optimizations
Safe by default - XSS-safe HTML and Markdown output via JustHTML sanitization

Installation

# From PyPI (when published)
pip install article-extractor

# From GitHub
pip install git+https://github.com/pankaj28843/article-extractor.git

# With uv
uv add git+https://github.com/pankaj28843/article-extractor.git

Quick Start

from article_extractor import extract_article

html = """
<html>
<body>
    <nav><a href="/">Home</a></nav>
    <article>
        <h1>My Article</h1>
        <p>This is the main content of the article...</p>
    </article>
    <footer>Copyright 2025</footer>
</body>
</html>
"""

result = extract_article(html, url="https://example.com/article")

print(result.title)      # "My Article"
print(result.markdown)   # "# My Article\n\nThis is the main content..."
print(result.word_count) # 8
print(result.success)    # True

API Reference

`extract_article(html, url="", options=None) -> ArticleResult`

Extract main article content from HTML.

Parameters:

html (str | bytes): HTML content to extract from
url (str): Original URL (used for title fallback)
options (ExtractionOptions | None): Extraction configuration

Returns: ArticleResult with extracted content

`extract_article_from_url(url, fetcher=None, options=None, *, prefer_playwright=True) -> ArticleResult`

Async function to fetch URL and extract article content. If no fetcher is provided, auto-selects the best available fetcher.

Parameters:

url (str): URL to fetch
fetcher (Fetcher | None): Object implementing the Fetcher protocol (optional - auto-creates if not provided)
options (ExtractionOptions | None): Extraction configuration
prefer_playwright (bool): If auto-creating fetcher, prefer Playwright (default: True)

`ArticleResult`

@dataclass
class ArticleResult:
    url: str                       # Original URL
    title: str                     # Extracted title
    content: str                   # Cleaned HTML content
    markdown: str                  # Markdown conversion
    excerpt: str                   # Short excerpt (first ~200 chars)
    word_count: int                # Word count of extracted content
    success: bool                  # Whether extraction succeeded
    error: str | None = None       # Error message if failed
    author: str | None = None      # Extracted author (if found)
    date_published: str | None = None  # Publication date (if found)
    language: str | None = None    # Document language (if detected)
    warnings: list[str] = []       # Non-fatal warnings

`ExtractionOptions`

@dataclass
class ExtractionOptions:
    min_word_count: int = 150       # Minimum words for valid content
    min_char_threshold: int = 500   # Minimum chars for candidate consideration
    include_images: bool = True     # Include images in output
    include_code_blocks: bool = True # Include code blocks in output
    safe_markdown: bool = True      # XSS-safe output (recommended)

How It Works

Parse HTML using JustHTML's HTML5-compliant parser
Clean document by removing scripts, styles, nav, footer, etc.
Find candidates - Look for <article>, <main>, or high-scoring <div>/<section> elements
Score candidates using Readability.js-style algorithm:
- Tag-based scores (article: +5, div: +5, h1-h6: -5)
- Class/ID pattern matching (+25 for "article", "content"; -25 for "sidebar", "footer")
- Paragraph content scoring (+1 per comma, +1 per 100 chars)
- Link density penalty (high link ratio = navigation)
Extract content from top-scoring candidate
Convert to Markdown using JustHTML's GFM-compatible converter

Development

# Clone and install
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras

# Run tests
uv run pytest

# Format and lint
uv run ruff format .
uv run ruff check --fix .

# Run with coverage
uv run pytest --cov

License

MIT License - see LICENSE for details.

Credits

JustHTML - Pure Python HTML5 parser
Mozilla Readability.js - Scoring algorithm inspiration
Mercury Parser - Additional scoring patterns

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psjinx

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.9

Apr 19, 2026

0.5.8

Mar 25, 2026

0.5.7

Mar 25, 2026

0.5.6

Mar 5, 2026

0.5.5

Jan 20, 2026

0.5.4

Jan 19, 2026

0.5.3

Jan 9, 2026

0.5.2

Jan 8, 2026

0.5.1

Jan 7, 2026

0.5.0

Jan 7, 2026

0.4.2

Jan 6, 2026

0.4.1

Jan 4, 2026

0.4.0

Jan 3, 2026

0.3.2

Jan 2, 2026

0.3.1

Jan 2, 2026

0.3.0

Jan 2, 2026

0.2.1

Jan 1, 2026

0.2.0

Jan 1, 2026

0.1.2

Jan 1, 2026

0.1.1

Dec 29, 2025

This version

0.1.0

Dec 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.1.0.tar.gz (56.7 kB view details)

Uploaded Dec 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

article_extractor-0.1.0-py3-none-any.whl (19.1 kB view details)

Uploaded Dec 29, 2025 Python 3

File details

Details for the file article_extractor-0.1.0.tar.gz.

File metadata

Download URL: article_extractor-0.1.0.tar.gz
Upload date: Dec 29, 2025
Size: 56.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`52e37e0a305e93c3c9e0d56840df9859f7e4c4f0337501616350d658c916319e`
MD5	`e478b83cfcfcb481b9f626b4f57991da`
BLAKE2b-256	`22a55a801addc87743f3d64dc846aeeaf04fb106bf88aec8f2d2a0ed21245b8d`

See more details on using hashes here.

File details

Details for the file article_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: article_extractor-0.1.0-py3-none-any.whl
Upload date: Dec 29, 2025
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80cdbb0af08bda6a240abba9f7f87ef7285b920fe5775ce95567e2aa855c34a9`
MD5	`9331dc5eb2ca4bcb96bb5f7e5e14ce29`
BLAKE2b-256	`10e8d00839d1bea84ebab31d121e67b11e241f45eab08f52a4f62baaea0e6e66`

See more details on using hashes here.

article-extractor 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Article Extractor

Features

Installation

Quick Start

API Reference

extract_article(html, url="", options=None) -> ArticleResult

extract_article_from_url(url, fetcher=None, options=None, *, prefer_playwright=True) -> ArticleResult

ArticleResult

ExtractionOptions

How It Works

Development

License

Credits

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`extract_article(html, url="", options=None) -> ArticleResult`

`extract_article_from_url(url, fetcher=None, options=None, *, prefer_playwright=True) -> ArticleResult`

`ArticleResult`

`ExtractionOptions`