Extracts the main content from webpages and generates clean, reader-friendly HTML

These details have not been verified by PyPI

Project links

Project description

Web Reader Mode

A Python package that extracts the main content (text and images) from a webpage, similar to iPhone's browser reader mode.

Features

Extracts the main article content from a webpage
Removes ads, navigation, and other distractions
Downloads and saves images locally
Outputs content in plain text or JSON format
Generates clean, reader-friendly HTML pages

Installation

Using Poetry (Recommended)

Clone this repository:

git clone <repository-url>
cd <repository-directory>

Install with Poetry:
```
poetry install
```
Alternatively, install from PyPI:
```
pip install orman-news-reader
```

Usage

Using Poetry

poetry run orman-web-reader [command] [options]

Available commands:

reader: Extract content from a webpage
html: Generate a clean HTML page from a webpage
google: Extract content from Google News RSS articles

Reader Mode

Extract content from a webpage:

poetry run orman-web-reader reader https://example.com/article

Save images to a specific directory:

poetry run orman-web-reader reader https://example.com/article --output-dir images

Output in JSON format:

poetry run orman-web-reader reader https://example.com/article --json

Full options:

poetry run orman-web-reader reader --help

Google News RSS Extractor

Extract content from Google News RSS articles with redirect handling:

poetry run orman-web-reader google https://news.google.com/rss/articles/[article-id]

Output in JSON format:

poetry run orman-web-reader google https://news.google.com/rss/articles/[article-id] --json

Save images to a specific directory:

poetry run orman-web-reader google https://news.google.com/rss/articles/[article-id] --output-dir images

Full options:

poetry run orman-web-reader google --help

HTML Generator

Generate a clean HTML page from a webpage:

poetry run orman-web-reader html https://example.com/article

Specify output file and image directory:

poetry run orman-web-reader html https://example.com/article --output-file my_article.html --image-dir my_images

Process a Google News RSS article with redirect handling:

poetry run orman-web-reader html https://news.google.com/rss/articles/[article-id] --google-news

Use a custom CSS file:

poetry run orman-web-reader html https://example.com/article --css-file custom.css

Full options:

poetry run orman-web-reader html --help

Direct Script Execution

You can also use the individual scripts directly:

poetry run orman-reader-mode https://example.com/article
poetry run orman-html-generator https://example.com/article

Using as a Module

You can use the reader mode as a module in your own Python scripts:

from orman_news_reader import extract_content
from orman_news_reader.html_generator import generate_html
from orman_news_reader.google_rss_extractor import extract_google_news_content

# Extract content from a regular URL
content = extract_content("https://example.com/article", "images")

# Access the extracted content
title = content['title']
paragraphs = content['text']
images = content['images']

# Generate HTML from the content
generate_html(content, "output.html")

# Extract content from a Google News RSS URL
google_content = extract_google_news_content("https://news.google.com/rss/articles/[article-id]", "images")

# Access Google News content with banner image
title = google_content['title']
banner_image = google_content.get('banner_image_url')
content_elements = google_content['content_elements']

Example

poetry run orman-web-reader html https://example.com/article --output-file article.html

This will:

Extract the main content from the article
Save any images to the 'images' directory
Generate a clean HTML file with the article content
Apply a responsive design that works well on all devices

How It Works

The package uses:

requests to fetch the webpage
readability-lxml to extract the main content
BeautifulSoup to parse the HTML and extract text and images
Pillow for image processing
Selenium for handling JavaScript redirects in Google News RSS articles

Standard Reader Mode

The standard reader mode extracts content directly from the provided URL using readability algorithms.

Google News RSS Extractor

The Google News RSS extractor:

Uses Selenium WebDriver to follow redirects from Google News URLs to the actual article
Extracts high-quality banner images for carousels from Open Graph tags, Twitter cards, or featured images
Processes the article content using the standard reader mode
Returns a structured object with the article title, content, and banner image URL

Requirements

Python 3.8+
Chrome/Chromium browser (for Selenium when using Google News features)
Dependencies are managed by Poetry

Development

Setting up the development environment

git clone <repository-url>
cd <repository-directory>
poetry install

Running tests

poetry run pytest

Building the package

poetry build

Publishing to PyPI

poetry publish

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.4

Mar 13, 2025

0.3.3

Mar 13, 2025

0.3.2

Mar 13, 2025

0.3.1

Mar 13, 2025

0.3.0

Mar 13, 2025

0.2.2

Mar 13, 2025

0.2.1

Mar 13, 2025

0.2.0

Mar 13, 2025

0.1.0

Mar 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orman_news_reader-0.3.4.tar.gz (14.2 kB view details)

Uploaded Mar 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

orman_news_reader-0.3.4-py3-none-any.whl (16.7 kB view details)

Uploaded Mar 13, 2025 Python 3

File details

Details for the file orman_news_reader-0.3.4.tar.gz.

File metadata

Download URL: orman_news_reader-0.3.4.tar.gz
Upload date: Mar 13, 2025
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.7 Darwin/24.0.0

File hashes

Hashes for orman_news_reader-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`98a5627db60e9204a9ca1980f59ab5a78a7cc202f6b2a51446625fea0b2816b6`
MD5	`e4ba84d4a0b026463bda2002c3aa4da9`
BLAKE2b-256	`021ed1d19ed755d3f16e16d89bab2959afb5ac1730207cd394f07b17445f07bc`

See more details on using hashes here.

File details

Details for the file orman_news_reader-0.3.4-py3-none-any.whl.

File metadata

Download URL: orman_news_reader-0.3.4-py3-none-any.whl
Upload date: Mar 13, 2025
Size: 16.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.7 Darwin/24.0.0

File hashes

Hashes for orman_news_reader-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63ebbe6e7096e16186ae31efe2006b5d081726906882756bdeb53ce77d291553`
MD5	`b3475997a360e8cf640103436883e12f`
BLAKE2b-256	`58aa0c411a44e1e3ab2f47084f500e22f05b25935a8bb03b73c78974d72592b3`

See more details on using hashes here.

orman-news-reader 0.3.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web Reader Mode

Features

Installation

Using Poetry (Recommended)

Usage

Using Poetry

Reader Mode

Google News RSS Extractor

HTML Generator

Direct Script Execution

Using as a Module

Example

How It Works

Standard Reader Mode

Google News RSS Extractor

Requirements

Development

Setting up the development environment

Running tests

Building the package

Publishing to PyPI

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes