Python port of the go-readability library for extracting the main content from web pages

These details have not been verified by PyPI

Project links

Project description

Readability Python

A high-fidelity Python port of the go-readability library, which itself is a Go port of Mozilla's Readability library. This library extracts the main content from HTML pages, removing navigation, ads, and other non-content elements, making it easier to read and process the actual content.

Features

Extract the main article content from HTML pages
Extract metadata (title, author, publication date, etc.)
Convert relative URLs to absolute URLs
Generate both HTML and plain text versions of the content
Handle various edge cases (hidden elements, malformed HTML, etc.)
Pythonic API with explicit error handling

Installation

pip install readability-python

# Install from source
git clone https://github.com/CyranoB/readability-python.git
cd readability-python
pip install -e .

# With Poetry
poetry add readability-python

Usage

Basic Usage

from readability import Readability

# Parse HTML content
parser = Readability()
article, error = parser.parse(html_content, url="https://example.com/article")

if error:
    print(f"Error: {error}")
else:
    # Access extracted content and metadata
    print(f"Title: {article.title}")
    print(f"Byline: {article.byline}")
    print(f"Content: {article.content}")  # HTML content
    print(f"Text Content: {article.text_content}")  # Plain text content
    print(f"Excerpt: {article.excerpt}")
    print(f"Site Name: {article.site_name}")
    print(f"Image: {article.image}")
    print(f"Favicon: {article.favicon}")
    print(f"Length: {article.length}")
    print(f"Published Time: {article.published_time}")

CLI Usage

# Extract content from a URL
readability-python https://example.com/article --output article.html

# Extract content from a file
readability-python article.html --output extracted.html

# Output as JSON
readability-python https://example.com/article --format json --output article.json

# Output as plain text
readability-python https://example.com/article --format text --output article.txt

Note: When specifying output files, it's recommended to use either absolute paths or paths within a dedicated output directory (e.g., output/article.html) to avoid cluttering your project directory. Output files in the root directory (like extracted.html) are automatically added to .gitignore.

Testing

The library includes a comprehensive test suite to ensure compatibility with the original Go implementation. The tests are categorized by:

Functional Areas

HTML Parsing
Metadata Extraction
Content Identification
Content Cleaning
URL Handling
Visibility Detection
Text Normalization
Real-world Websites

Criticality Levels

P0 (Critical) - Core functionality that must work
P1 (High) - Important functionality with significant impact
P2 (Medium) - Functionality that should work but has workarounds
P3 (Low) - Nice-to-have functionality with minimal impact

Test Types

Basic - Tests for basic functionality
Feature - Tests for specific features
Edge Case - Tests for handling edge cases
Real-world - Tests using real-world websites

To run the tests:

# Run all tests
pytest

# Run tests by functional area
pytest -m "area_html_parsing"

# Run tests by criticality
pytest -m "criticality_p0"

# Run tests by type
pytest -m "type_real_world"

Test Coverage

The library has extensive test coverage across different functional areas and criticality levels:

Functional Area	P0	P1	P2	P3	Total
HTML Parsing	0	0	2	0	2
Metadata Extraction	0	3	0	0	3
Content Identification	2	0	0	0	2
Content Cleaning	1	5	1	0	7
URL Handling	0	3	0	0	3
Visibility Detection	1	1	0	0	2
Text Normalization	0	1	3	0	4
Real-world Websites	4	1	2	7	14
Total	8	14	8	7	37

Test Type Distribution

Test Type	Count	Percentage
Basic	2	5.4%
Feature	14	37.8%
Edge Case	7	18.9%
Real-world	14	37.8%

Comparison with Go Implementation

This library aims to be a high-fidelity port of the go-readability library, with the following considerations:

Maintains the same functionality and behavior
Uses Python best practices and idioms where appropriate
Adapts the API to be more Pythonic while maintaining the same core functionality
Uses BeautifulSoup for HTML parsing instead of Go's DOM implementation
Maps Go's DOM traversal methods to BeautifulSoup's methods

Development

Requirements

Python 3.6+
Poetry (optional, for dependency management)

Setup

# Clone the repository
git clone https://github.com/CyranoB/readability-python.git
cd readability-python

# Install dependencies with pip
pip install -e ".[dev]"

# Or with Poetry
poetry install

Development Workflow

# Run tests
pytest

# Format code
black readability tests

# Lint code
ruff readability tests

# Type check
mypy readability

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Adding New Test Cases

Create a new directory in tests/test-pages/ with a descriptive name
Add the following files to the directory:
- source.html - The HTML to parse
- expected.html - The expected content
- expected-metadata.json - The expected metadata
Add the test case to tests/test_categories.py with appropriate categorization

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

May 2, 2025

0.4.0

May 2, 2025

This version

0.3.0

May 1, 2025

0.2.0

Apr 30, 2025

0.1.0

Apr 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability_python-0.3.0.tar.gz (35.0 kB view details)

Uploaded May 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

readability_python-0.3.0-py3-none-any.whl (36.3 kB view details)

Uploaded May 1, 2025 Python 3

File details

Details for the file readability_python-0.3.0.tar.gz.

File metadata

Download URL: readability_python-0.3.0.tar.gz
Upload date: May 1, 2025
Size: 35.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for readability_python-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`236cf713fadd6c2d1b93e88e0901c92b79e1fcbed1abe180499e300504714276`
MD5	`fdbe7e2637e7a26149205fce7d191404`
BLAKE2b-256	`92c0980a221fd8c6aabcb5f2056d9b86ffae2f9abd15a9d6861b96e23c0fde0f`

See more details on using hashes here.

File details

Details for the file readability_python-0.3.0-py3-none-any.whl.

File metadata

Download URL: readability_python-0.3.0-py3-none-any.whl
Upload date: May 1, 2025
Size: 36.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for readability_python-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e41a73f45edf5e655edd551f897491ce6fdc934369a053239e2c123007697ed`
MD5	`64133345d65ae7db2471f90f8ebb4c3a`
BLAKE2b-256	`5dc0a01cd97a40be37ae4e03ced732050c2c1d3cbfd4f0313c81404b333c3519`

See more details on using hashes here.

readability-python 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Readability Python

Features

Installation

Usage

Basic Usage

CLI Usage

Testing

Functional Areas

Criticality Levels

Test Types

Test Coverage

Test Type Distribution

Comparison with Go Implementation

Development

Requirements

Setup

Development Workflow

Contributing

Adding New Test Cases

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes