Extracts the main content from webpages and generates clean, reader-friendly HTML
Project description
Web Reader Mode
A Python package that extracts the main content (text and images) from a webpage, similar to iPhone's browser reader mode.
Features
- Extracts the main article content from a webpage
- Removes ads, navigation, and other distractions
- Downloads and saves images locally
- Outputs content in plain text or JSON format
- Generates clean, reader-friendly HTML pages
Installation
Using Poetry (Recommended)
-
Clone this repository:
git clone <repository-url> cd <repository-directory> -
Install with Poetry:
poetry install -
Alternatively, install from PyPI:
pip install orman-news-reader
Usage
Using Poetry
poetry run orman-web-reader [command] [options]
Available commands:
reader: Extract content from a webpagehtml: Generate a clean HTML page from a webpagegoogle: Extract content from Google News RSS articles
Reader Mode
Extract content from a webpage:
poetry run orman-web-reader reader https://example.com/article
Save images to a specific directory:
poetry run orman-web-reader reader https://example.com/article --output-dir images
Output in JSON format:
poetry run orman-web-reader reader https://example.com/article --json
Full options:
poetry run orman-web-reader reader --help
Google News RSS Extractor
Extract content from Google News RSS articles with redirect handling:
poetry run orman-web-reader google https://news.google.com/rss/articles/[article-id]
Output in JSON format:
poetry run orman-web-reader google https://news.google.com/rss/articles/[article-id] --json
Save images to a specific directory:
poetry run orman-web-reader google https://news.google.com/rss/articles/[article-id] --output-dir images
Full options:
poetry run orman-web-reader google --help
HTML Generator
Generate a clean HTML page from a webpage:
poetry run orman-web-reader html https://example.com/article
Specify output file and image directory:
poetry run orman-web-reader html https://example.com/article --output-file my_article.html --image-dir my_images
Process a Google News RSS article with redirect handling:
poetry run orman-web-reader html https://news.google.com/rss/articles/[article-id] --google-news
Use a custom CSS file:
poetry run orman-web-reader html https://example.com/article --css-file custom.css
Full options:
poetry run orman-web-reader html --help
Direct Script Execution
You can also use the individual scripts directly:
poetry run orman-reader-mode https://example.com/article
poetry run orman-html-generator https://example.com/article
Using as a Module
You can use the reader mode as a module in your own Python scripts:
from orman_news_reader import extract_content
from orman_news_reader.html_generator import generate_html
from orman_news_reader.google_rss_extractor import extract_google_news_content
# Extract content from a regular URL
content = extract_content("https://example.com/article", "images")
# Access the extracted content
title = content['title']
paragraphs = content['text']
images = content['images']
# Generate HTML from the content
generate_html(content, "output.html")
# Extract content from a Google News RSS URL
google_content = extract_google_news_content("https://news.google.com/rss/articles/[article-id]", "images")
# Access Google News content with banner image
title = google_content['title']
banner_image = google_content.get('banner_image_url')
content_elements = google_content['content_elements']
Example
poetry run orman-web-reader html https://example.com/article --output-file article.html
This will:
- Extract the main content from the article
- Save any images to the 'images' directory
- Generate a clean HTML file with the article content
- Apply a responsive design that works well on all devices
How It Works
The package uses:
requeststo fetch the webpagereadability-lxmlto extract the main contentBeautifulSoupto parse the HTML and extract text and imagesPillowfor image processingSeleniumfor handling JavaScript redirects in Google News RSS articles
Standard Reader Mode
The standard reader mode extracts content directly from the provided URL using readability algorithms.
Google News RSS Extractor
The Google News RSS extractor:
- Uses Selenium WebDriver to follow redirects from Google News URLs to the actual article
- Extracts high-quality banner images for carousels from Open Graph tags, Twitter cards, or featured images
- Processes the article content using the standard reader mode
- Returns a structured object with the article title, content, and banner image URL
Requirements
- Python 3.8+
- Chrome/Chromium browser (for Selenium when using Google News features)
- Dependencies are managed by Poetry
Development
Setting up the development environment
git clone <repository-url>
cd <repository-directory>
poetry install
Running tests
poetry run pytest
Building the package
poetry build
Publishing to PyPI
poetry publish
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file orman_news_reader-0.3.4.tar.gz.
File metadata
- Download URL: orman_news_reader-0.3.4.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.7 Darwin/24.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98a5627db60e9204a9ca1980f59ab5a78a7cc202f6b2a51446625fea0b2816b6
|
|
| MD5 |
e4ba84d4a0b026463bda2002c3aa4da9
|
|
| BLAKE2b-256 |
021ed1d19ed755d3f16e16d89bab2959afb5ac1730207cd394f07b17445f07bc
|
File details
Details for the file orman_news_reader-0.3.4-py3-none-any.whl.
File metadata
- Download URL: orman_news_reader-0.3.4-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.7 Darwin/24.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63ebbe6e7096e16186ae31efe2006b5d081726906882756bdeb53ce77d291553
|
|
| MD5 |
b3475997a360e8cf640103436883e12f
|
|
| BLAKE2b-256 |
58aa0c411a44e1e3ab2f47084f500e22f05b25935a8bb03b73c78974d72592b3
|