Skip to main content

Pure-Python article extraction library using Readability-style scoring

Project description

Article Extractor

Pure-Python article extraction library that extracts main content from HTML documents and converts to Markdown.

Uses JustHTML for HTML parsing and implements Readability.js-style scoring for content detection.

Features

  • Pure Python - No external services or JavaScript runtime required
  • Readability-style scoring - Identifies main content using proven algorithms from Mozilla's Readability.js
  • Markdown output - Converts extracted content to clean GitHub-Flavored Markdown
  • Fast - Caches text calculations, uses early termination optimizations
  • Safe by default - XSS-safe HTML and Markdown output via JustHTML sanitization

Installation

# From PyPI (when published)
pip install article-extractor

# From GitHub
pip install git+https://github.com/pankaj28843/article-extractor.git

# With uv
uv add git+https://github.com/pankaj28843/article-extractor.git

Quick Start

from article_extractor import extract_article

html = """
<html>
<body>
    <nav><a href="/">Home</a></nav>
    <article>
        <h1>My Article</h1>
        <p>This is the main content of the article...</p>
    </article>
    <footer>Copyright 2025</footer>
</body>
</html>
"""

result = extract_article(html, url="https://example.com/article")

print(result.title)      # "My Article"
print(result.markdown)   # "# My Article\n\nThis is the main content..."
print(result.word_count) # 8
print(result.success)    # True

API Reference

extract_article(html, url="", options=None) -> ArticleResult

Extract main article content from HTML.

Parameters:

  • html (str | bytes): HTML content to extract from
  • url (str): Original URL (used for title fallback)
  • options (ExtractionOptions | None): Extraction configuration

Returns: ArticleResult with extracted content

extract_article_from_url(url, fetcher=None, options=None, *, prefer_playwright=True) -> ArticleResult

Async function to fetch URL and extract article content. If no fetcher is provided, auto-selects the best available fetcher.

Parameters:

  • url (str): URL to fetch
  • fetcher (Fetcher | None): Object implementing the Fetcher protocol (optional - auto-creates if not provided)
  • options (ExtractionOptions | None): Extraction configuration
  • prefer_playwright (bool): If auto-creating fetcher, prefer Playwright (default: True)

ArticleResult

@dataclass
class ArticleResult:
    url: str                       # Original URL
    title: str                     # Extracted title
    content: str                   # Cleaned HTML content
    markdown: str                  # Markdown conversion
    excerpt: str                   # Short excerpt (first ~200 chars)
    word_count: int                # Word count of extracted content
    success: bool                  # Whether extraction succeeded
    error: str | None = None       # Error message if failed
    author: str | None = None      # Extracted author (if found)
    date_published: str | None = None  # Publication date (if found)
    language: str | None = None    # Document language (if detected)
    warnings: list[str] = []       # Non-fatal warnings

ExtractionOptions

@dataclass
class ExtractionOptions:
    min_word_count: int = 150       # Minimum words for valid content
    min_char_threshold: int = 500   # Minimum chars for candidate consideration
    include_images: bool = True     # Include images in output
    include_code_blocks: bool = True # Include code blocks in output
    safe_markdown: bool = True      # XSS-safe output (recommended)

How It Works

  1. Parse HTML using JustHTML's HTML5-compliant parser
  2. Clean document by removing scripts, styles, nav, footer, etc.
  3. Find candidates - Look for <article>, <main>, or high-scoring <div>/<section> elements
  4. Score candidates using Readability.js-style algorithm:
    • Tag-based scores (article: +5, div: +5, h1-h6: -5)
    • Class/ID pattern matching (+25 for "article", "content"; -25 for "sidebar", "footer")
    • Paragraph content scoring (+1 per comma, +1 per 100 chars)
    • Link density penalty (high link ratio = navigation)
  5. Extract content from top-scoring candidate
  6. Convert to Markdown using JustHTML's GFM-compatible converter

Development

# Clone and install
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
uv sync --all-extras

# Run tests
uv run pytest

# Format and lint
uv run ruff format .
uv run ruff check --fix .

# Run with coverage
uv run pytest --cov

License

MIT License - see LICENSE for details.

Credits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_extractor-0.1.0.tar.gz (56.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

article_extractor-0.1.0-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file article_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: article_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 56.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 52e37e0a305e93c3c9e0d56840df9859f7e4c4f0337501616350d658c916319e
MD5 e478b83cfcfcb481b9f626b4f57991da
BLAKE2b-256 22a55a801addc87743f3d64dc846aeeaf04fb106bf88aec8f2d2a0ed21245b8d

See more details on using hashes here.

File details

Details for the file article_extractor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: article_extractor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for article_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 80cdbb0af08bda6a240abba9f7f87ef7285b920fe5775ce95567e2aa855c34a9
MD5 9331dc5eb2ca4bcb96bb5f7e5e14ce29
BLAKE2b-256 10e8d00839d1bea84ebab31d121e67b11e241f45eab08f52a4f62baaea0e6e66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page