Python port of the go-readability library for extracting the main content from web pages
Project description
Readability Python
A high-fidelity Python port of the go-readability library, which itself is a Go port of Mozilla's Readability library. This library extracts the main content from HTML pages, removing navigation, ads, and other non-content elements, making it easier to read and process the actual content.
Features
- Extract the main article content from HTML pages
- Extract metadata (title, author, publication date, etc.)
- Convert relative URLs to absolute URLs
- Generate both HTML and plain text versions of the content
- Handle various edge cases (hidden elements, malformed HTML, etc.)
- Pythonic API with explicit error handling
Installation
# Not yet available on PyPI
pip install readability-python
# Install from source
git clone https://github.com/CyranoB/readability-python.git
cd readability-python
pip install -e .
# With Poetry
poetry add readability-python
Usage
Basic Usage
from readability import Readability
# Parse HTML content
parser = Readability()
article, error = parser.parse(html_content, url="https://example.com/article")
if error:
print(f"Error: {error}")
else:
# Access extracted content and metadata
print(f"Title: {article.title}")
print(f"Byline: {article.byline}")
print(f"Content: {article.content}") # HTML content
print(f"Text Content: {article.text_content}") # Plain text content
print(f"Excerpt: {article.excerpt}")
print(f"Site Name: {article.site_name}")
print(f"Image: {article.image}")
print(f"Favicon: {article.favicon}")
print(f"Length: {article.length}")
print(f"Published Time: {article.published_time}")
CLI Usage
# Extract content from a URL
readability-python https://example.com/article --output article.html
# Extract content from a file
readability-python article.html --output extracted.html
# Output as JSON
readability-python https://example.com/article --format json --output article.json
# Output as plain text
readability-python https://example.com/article --format text --output article.txt
Note: When specifying output files, it's recommended to use either absolute paths or paths within a dedicated output directory (e.g.,
output/article.html) to avoid cluttering your project directory. Output files in the root directory (likeextracted.html) are automatically added to.gitignore.
Testing
The library includes a comprehensive test suite to ensure compatibility with the original Go implementation. The tests are categorized by:
Functional Areas
- HTML Parsing
- Metadata Extraction
- Content Identification
- Content Cleaning
- URL Handling
- Visibility Detection
- Text Normalization
- Real-world Websites
Criticality Levels
- P0 (Critical) - Core functionality that must work
- P1 (High) - Important functionality with significant impact
- P2 (Medium) - Functionality that should work but has workarounds
- P3 (Low) - Nice-to-have functionality with minimal impact
Test Types
- Basic - Tests for basic functionality
- Feature - Tests for specific features
- Edge Case - Tests for handling edge cases
- Real-world - Tests using real-world websites
To run the tests:
# Run all tests
pytest
# Run tests by functional area
pytest -m "area_html_parsing"
# Run tests by criticality
pytest -m "criticality_p0"
# Run tests by type
pytest -m "type_real_world"
Test Coverage
The library has extensive test coverage across different functional areas and criticality levels:
| Functional Area | P0 | P1 | P2 | P3 | Total |
|---|---|---|---|---|---|
| HTML Parsing | 0 | 0 | 2 | 0 | 2 |
| Metadata Extraction | 0 | 3 | 0 | 0 | 3 |
| Content Identification | 2 | 0 | 0 | 0 | 2 |
| Content Cleaning | 1 | 5 | 1 | 0 | 7 |
| URL Handling | 0 | 3 | 0 | 0 | 3 |
| Visibility Detection | 1 | 1 | 0 | 0 | 2 |
| Text Normalization | 0 | 1 | 3 | 0 | 4 |
| Real-world Websites | 4 | 1 | 2 | 7 | 14 |
| Total | 8 | 14 | 8 | 7 | 37 |
Test Type Distribution
| Test Type | Count | Percentage |
|---|---|---|
| Basic | 2 | 5.4% |
| Feature | 14 | 37.8% |
| Edge Case | 7 | 18.9% |
| Real-world | 14 | 37.8% |
Comparison with Go Implementation
This library aims to be a high-fidelity port of the go-readability library, with the following considerations:
- Maintains the same functionality and behavior
- Uses Python best practices and idioms where appropriate
- Adapts the API to be more Pythonic while maintaining the same core functionality
- Uses BeautifulSoup for HTML parsing instead of Go's DOM implementation
- Maps Go's DOM traversal methods to BeautifulSoup's methods
Development
Requirements
- Python 3.6+
- Poetry (optional, for dependency management)
Setup
# Clone the repository
git clone https://github.com/CyranoB/readability-python.git
cd readability-python
# Install dependencies with pip
pip install -e ".[dev]"
# Or with Poetry
poetry install
Development Workflow
# Run tests
pytest
# Format code
black readability tests
# Lint code
ruff readability tests
# Type check
mypy readability
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Adding New Test Cases
- Create a new directory in
tests/test-pages/with a descriptive name - Add the following files to the directory:
source.html- The HTML to parseexpected.html- The expected contentexpected-metadata.json- The expected metadata
- Add the test case to
tests/test_categories.pywith appropriate categorization
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file readability_python-0.2.0.tar.gz.
File metadata
- Download URL: readability_python-0.2.0.tar.gz
- Upload date:
- Size: 34.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6caf0f739add18a5c5a3e3092a724165a9f94eca4273e635c154d456ce27a34
|
|
| MD5 |
2e0cbd7cc120d6a122935d8deffcf389
|
|
| BLAKE2b-256 |
3402177eaba311a780fe1df3b344466dce9ce6253822e0e9bd233e5e4c95d2cf
|
File details
Details for the file readability_python-0.2.0-py3-none-any.whl.
File metadata
- Download URL: readability_python-0.2.0-py3-none-any.whl
- Upload date:
- Size: 35.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6fdb29b5ab5f367fc47745ac5d2e2e920594292aa8659b1071ae21a857c3e154
|
|
| MD5 |
1527d8e41b703e6e3c918c4df793bb76
|
|
| BLAKE2b-256 |
1397faab6a79533197f6d7c2a06a81b3c8d730ac6cbf48e5b48449927b5e4f76
|