Minimal news fetching: article text, RSS, Google News search, RSS discovery

These details have not been verified by PyPI

Project links

Homepage

Project description

📰 open-news

Zero-Config News Fetching & Article Extraction for Python

A lightweight, batteries-included Python package for fetching news articles, extracting content, discovering RSS feeds, and batch processing with summarization.

Features • Installation • Quick Start • API Reference • Contributing

🔁 Latest Updates

[29-06-2026 - Latest] - Major v0.2.0 release: auto-discovering feed registry, Google News merged into curated feeds, article dedupe, smart URL resolution, category field, search_site(), force-refresh/cache control, full internal package restructure — Release v0.2.0

[21-06-2026] - Fixed a critical import bug, fixed publish_date extraction (3 code paths), added js=True support and rotating User-Agents — Release v0.1.2

[18-06-2026] - Packaging metadata for PyPI release (no functional changes), Release v0.1.1.

[17-06-2026] - Initial Stable Release v0.1.0

View more on our Changelog

🎯 Features

📄 Article Extraction

Pulls full text and metadata (title, authors, publish date, top image) straight from a page's HTML using a built-in lxml-based extractor — no third-party extraction library required.

📡 Live News Feeds

Access curated RSS feeds with zero local configuration:

50+ country-specific feeds (India, USA, UK, and many more)
Category feeds (business, politics, geopolitics)
Every feed file now includes a locale-targeted Google News RSS entry merged in alongside direct outlet feeds
Auto-discovered via a remote registry — new categories/countries can be added without a package update
Sourced from open-feeds

🔍 Google News Search

Search across Google News with decoded URLs:

Real article links (via googlenewsdecoder), with a graceful fallback to the raw redirect URL if decoding fails
Rich metadata included

🔗 RSS Discovery

Auto-discover RSS feeds from any website:

Built with BeautifulSoup + lxml
Fetch articles from discovered feeds instantly

⚡ Smart Caching

24-hour feed caching to minimize network requests and improve performance

🚀 Batch Processing & Summarization

Process multiple articles concurrently with built-in summarization:

Fetch and summarize batch URLs
Search Google News + fetch + summarize in one call
Configurable concurrency and timeouts
Lightweight extractive summarization

📦 Installation

Important note! - The actual package name is set to open-news-api insteed of project name open-news due to some PyPI issues.

From GitHub

git clone https://github.com/alphap365/open-news.git
cd open-news
pip install -e .

Direct Install

pip install open-news-api

To install a specific version

pip install open-news-api==v0.1.2 #Change the vtag with your choice tag

Dependencies installed automatically:

beautifulsoup4 • lxml • python-dateutil
feedparser • googlenewsdecoder • httpx • requests

🚀 Quick Start

1️⃣ Extract Article Content

from open_news import fetch_article

article = fetch_article("https://www.bbc.com/news/world-us-canada-12345678")

print(article["title"])
print(article["text"][:500])
print(f"Source: {article['source']}")
print(f"Published: {article['publish_date']}")

2️⃣ Search Google News

from open_news import search_news

results = search_news("artificial intelligence", limit=5)

for article in results:
    print(f"✓ {article['title']}")
    print(f"  → {article['url']}\n")

3️⃣ Get Live News (Country-Specific)

from open_news import get_live_news

# Get top news from India
india_news = get_live_news(country="india", limit_per_feed=3)

for article in india_news:
    print(f"[{article['source']}] {article['title']}")
    print(f"Published: {article['published']}\n")

4️⃣ Get Category News

# Business news from curated feeds
business = get_live_news(category="business", limit_per_feed=2)

for article in business:
    print(f"{article['title']}")

5️⃣ Discover & Fetch RSS Feeds

from open_news import get_articles_from_website_rss

# Auto-discover RSS from any website
articles = get_articles_from_website_rss("https://techcrunch.com", limit=5)

for article in articles:
    print(f"✓ {article['title']}")

6️⃣ Batch Fetch & Summarize Articles

from open_news import fetch_and_summarize_batch

urls = [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3",
]

results = fetch_and_summarize_batch(urls, sentence_count=2, max_workers=3)

for result in results:
    if result["status"] == "success":
        print(f"📰 {result['title']}")
        print(f"   Summary: {result['summary']}\n")
    else:
        print(f"❌ Failed: {result['error']}")

7️⃣ Search & Summarize in One Call

from open_news import fetch_and_summarize_search_results

results = fetch_and_summarize_search_results(
    "climate change",
    limit=5,
    sentence_count=2,
    max_workers=3
)

for article in results:
    print(f"🔗 {article['url']}")
    print(f"📰 {article['title']}")
    print(f"   {article['summary']}\n")

8️⃣ Search a Single Domain

from open_news import search_site

results = search_site("alcoholism", domain="timesofindia.indiatimes.com", limit=5)

for article in results:
    print(f"✓ {article['title']}")
    print(f"  → {article['url']}\n")

9️⃣ Force-Refresh Feeds & Clear Cache

from open_news import live_news, clear_feed_cache

# Bypass the 24h cache and fetch fresh feed lists right now
fresh = live_news(category="business", force_refresh=True)

# Clear one cached entry, or wipe everything
clear_feed_cache(category="business")
clear_feed_cache()  # clears the entire cache directory

🔟 Dedupe Articles

from open_news import live_news, dedupe_articles

# Dedupe is on by default for live_news, discover_and_get, batch_summarize,
# and search_and_summarize — disable per-call if you want raw results:
raw = live_news(category="news", dedupe=False)

# Or dedupe any list of article dicts yourself, with optional fuzzy
# title matching for same-story-different-outlet duplicates:
merged = dedupe_articles(raw, fuzzy=True)

1️⃣1️⃣ JS-Heavy Pages (Optional)

Some sites render their article body client-side and return little to nothing in the raw HTML. For these, install the optional js extra:

pip install open-news-api[js]
playwright install chromium   # one-time browser download

Then pass js=True to any fetch function:

from open_news import fetch_article

article = fetch_article("https://js-heavy-site.example.com/article", js=True)

js=True works on fetch_article, fetch_and_summarize_batch, and fetch_and_summarize_search_results. It's slower per-request (a real browser is launched), so consider lowering max_workers when batching with js=True. If the js extra isn't installed, it logs a warning and transparently falls back to the plain HTTP fetch instead of raising.

🔀 Function Names: New vs Legacy

Every function is available under two names — a short modern name and a longer legacy-style name (kept for backward compatibility with early releases). They are exact aliases; pick whichever reads better in your code.

Short name	Legacy alias
`get_article`	`fetch_article`
`search`	`search_news`
`live_news`	`get_live_news`
`discover_and_get`	`get_articles_from_website_rss`
`batch_summarize`	`fetch_and_summarize_batch`
`search_and_summarize`	`fetch_and_summarize_search_results`
`search_site`	(none — new in v0.2.0)
`clear_feed_cache`	(none — new in v0.2.0)
`dedupe_articles`	(none — new in v0.2.0)
`list_categories`	(none — new in v0.2.0)
`list_countries`	(none — new in v0.2.0)

Both forms are stable public API — neither is deprecated, and the docs below use the legacy names since they're more descriptive for newcomers, but feel free to import either.

📚 API Reference

`fetch_article(url: str) → Dict`

(alias: get_article)

Extract article content and metadata from a given URL.

Returns:

{
    "url": str,            # Original article URL
    "title": str,          # Article headline
    "text": str,           # Full article text
    "authors": list,       # Author names, if found
    "publish_date": str,   # ISO 8601 timestamp, or None if undetected
    "top_image": str,      # Best-guess main image URL, or None
    "images": list,        # All image URLs found in the article body
    "videos": list,        # Embedded video URLs (YouTube, Vimeo, etc.)
    "source": str,         # Website domain
    "meta": dict           # Raw metadata: description, site name, keywords, JSON-LD
}

Example:

article = fetch_article("https://example.com/article")
if article["text"]:
    print(f"✓ Successfully extracted: {article['title']}")
else:
    print("✗ Could not extract article content")

A note on reliability: extraction quality depends entirely on how clean and structured the page's HTML is. Heavily templated sites with lots of navigation or ad markup around the article body may need some trial and error — if a result looks off, check meta and images for clues about what got picked up.

`search_news(query: str, limit: int = 10) → List[Dict]`

(alias: search)

Search Google News for recent articles.

Parameters:

query (str): Search terms
limit (int): Maximum results to return (default: 10)

Returns:

[
    {
        "title": str,
        "url": str,           # Decoded real URL (when possible)
        "source": str,
        "published": str,     # ISO 8601 timestamp
        "description": str
    },
    ...
]

Example:

results = search_news("climate change", limit=5)
print(f"Found {len(results)} articles")

`get_live_news(country: str = None, category: str = "news", limit_per_feed: int = None) → List[Dict]`

(alias: live_news)

Fetch articles from curated RSS feeds.

Parameters:

country (str, optional): Two-letter country code
- Examples: "india", "usa", "uk", "pakistan"
- When set, category is ignored
category (str): News category when no country specified
- Options: "news", "business", "politics", "geopolitics"
- Default: "news"
limit_per_feed (int, optional): Articles per feed (default from remote config)

Returns:

[
    {
        "title": str,
        "url": str,
        "source": str,
        "published": str,
        "description": str
    },
    ...
]

Examples:

# Country-specific
india_news = get_live_news(country="india", limit_per_feed=5)

# Category-specific
business = get_live_news(category="business")

# Default news
general = get_live_news()

`get_articles_from_website_rss(website_url: str, limit: int = 10) → List[Dict]`

(alias: discover_and_get)

Discover and fetch articles from a website's RSS feed.

Parameters:

website_url (str): Website homepage URL
limit (int): Maximum articles to return

Returns: Same structure as get_live_news()

Example:

articles = get_articles_from_website_rss("https://hackernews.com", limit=10)
for article in articles:
    print(f"• {article['title']}")

`search_site(keyword: str, domain: str, limit: int = 10) → List[Dict]`

Search for articles on a single news domain matching a keyword. Scoped via Google News RSS + a site: filter, with a post-fetch domain-match check to filter out stray off-domain results.

Parameters:

keyword (str): Search terms
domain (str): Target domain — bare ("timesofindia.indiatimes.com") or a full URL (scheme/path stripped automatically)
limit (int): Maximum results (default: 10)

Returns: Same shape as search_news().

Example:

results = search_site("budget", domain="reuters.com", limit=5)

`live_news(..., force_refresh: bool = False, dedupe: bool = True, dedupe_fuzzy: bool = False)`

Three new parameters on top of the existing signature:

force_refresh (bool): Bypass and refresh the 24h feed-list cache.
dedupe (bool): Remove duplicate articles across feeds (default: True).
dedupe_fuzzy (bool): Also collapse near-duplicate titles across different URLs — same story, different outlets (default: False, slower).

`clear_feed_cache(category: str = None, country: str = None) → None`

Clears cached feed-list data. With no arguments, wipes the entire cache directory (including the registry index cache). With category or country, clears just that one entry.

`dedupe_articles(articles: List[Dict], fuzzy: bool = False) → List[Dict]`

Deduplicate a list of article dicts. Stage 1 (always): normalizes URLs — resolving Google News redirects, stripping www./scheme/trailing-slash/tracking params — and removes exact matches, preferring direct-outlet entries over aggregator entries on collision. Stage 2 (if fuzzy=True): collapses near-duplicate titles across different URLs using sequence matching (skipped automatically above 300 articles).

`list_categories() → List[str]` / `list_countries() → List[str]`

Returns the currently available category/country keys from the open-feeds registry — useful for building UIs or validating input without hardcoding strings.

`fetch_and_summarize_batch(urls, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, timeout_per_article=30, js=False) → List[Dict]`

(alias: batch_summarize)

Parameters:

urls (List[str]): Article URLs to process
sentence_count (int): Sentences per summary (default: 3)
include_full_text (bool): Include full article text in results (default: False)
include_images_videos (bool): Include images, videos, and top_image in results (default: False)
max_workers (int): Concurrent threads (default: 5)
timeout_per_article (int): Timeout per article in seconds (default: 30)
js (bool): Render pages with a headless browser before extraction (default: False, requires [js] extra)

Returns:

[
    {
        "url": str,
        "status": str,         # "success" or "failed"
        "title": str,
        "summary": str,
        "text": str,            # only if include_full_text=True
        "images": list,         # only if include_images_videos=True
        "videos": list,         # only if include_images_videos=True
        "top_image": str,       # only if include_images_videos=True
        "error": str            # only present when status == "failed"
    },
    ...
]

A timeout just shows up as a "failed" result with the timeout message in error — there's no separate "timeout" status, so check error if you need to distinguish why something failed.

Example:

from open_news import fetch_and_summarize_batch

urls = ["https://example.com/1", "https://example.com/2"]
results = fetch_and_summarize_batch(urls, sentence_count=2)

for result in results:
    if result["status"] == "success":
        print(f"✓ {result['title']}")
        print(f"  {result['summary']}")
    else:
        print(f"✗ {result['url']}: {result['error']}")

`fetch_and_summarize_search_results(query, limit=10, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, js=False) → List[Dict]`

(alias: search_and_summarize)

Parameters:

query (str): Search term
limit (int): Max results (default: 10)
sentence_count (int): Sentences per summary (default: 3)
include_full_text (bool): Include full text (default: False)
include_images_videos (bool): Include images/videos (default: False)
max_workers (int): Concurrent threads (default: 5)
js (bool): Render pages with a headless browser (default: False)

Returns: Merged list combining search metadata with extracted & summarized content

Example:

from open_news import fetch_and_summarize_search_results

results = fetch_and_summarize_search_results(
    "artificial intelligence",
    limit=5,
    sentence_count=2,
    max_workers=3
)

for article in results:
    print(f"Title: {article['title']}")
    print(f"Summary: {article['summary']}")

📡 RSS Feeds

This package uses curated RSS feed definitions from the open-feeds repository.

Feed Sources

Country-specific feeds (India, USA, UK, Pakistan, etc.)
Category feeds: General news, Business, Politics, Geopolitics
All feeds are community-maintained and regularly tested

Using the Feeds

The get_live_news() function fetches feeds dynamically from the open-feeds repository, so you always get the latest available feeds.

Contributing to Feeds

To add new RSS feeds or report broken feeds, visit the open-feeds repository and follow their contributing guidelines.

⚙️ Caching

Feeds are automatically cached for 24 hours in ~/.open_news/feeds_cache/ to reduce network requests.

Force refresh: pass force_refresh=True to live_news(), or call clear_feed_cache() to clear one entry or the entire cache directory programmatically — no need to manually delete files anymore.

🔧 Requirements

Python: 3.7+
Network: Internet connection for live feeds

📝 License

Licensed under the MIT License – see LICENSE file for details.

🤝 Contributing

We'd love your contributions! Whether it's:

🐛 Bug reports
✨ Feature requests
📝 Documentation improvements
🔗 Feed suggestions (see open-feeds)
💻 Pull requests

Please check out our Contributing Guide before getting started.

Ways to help:

Improve article extraction quality
Add language/region support
Write tests and documentation
Share and star the project ⭐
Contribute feeds to open-feeds

🙏 Acknowledgements

Built on the shoulders of amazing open-source projects:

feedparser – RSS parsing
googlenewsdecoder – URL decoding
BeautifulSoup4 – HTML parsing
lxml – XML processing
open-feeds – RSS feed curations

Made with ❤️ by Arajit Paul

⭐ Star us on GitHub | 📧 Email

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.0

Jun 29, 2026

0.1.2

Jun 21, 2026

0.1.1

Jun 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_news_api-0.2.0.tar.gz (24.7 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_news_api-0.2.0-py3-none-any.whl (11.3 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file open_news_api-0.2.0.tar.gz.

File metadata

Download URL: open_news_api-0.2.0.tar.gz
Upload date: Jun 29, 2026
Size: 24.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for open_news_api-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`84556d39ae92c1ce447784455119b2c4117b36f876c7a578c3c0c221ab98aaa8`
MD5	`65ed8ea88efa07d7186a39f3173082e7`
BLAKE2b-256	`950b0f06f283eeafaaf4444344bb8d7516e2d29373b1995fbdaa2118414bcb5c`

See more details on using hashes here.

File details

Details for the file open_news_api-0.2.0-py3-none-any.whl.

File metadata

Download URL: open_news_api-0.2.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for open_news_api-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3be0332a2d511b4c46f8bf70defbe1628919eb029b082ad2ffa4b5c984dff5c4`
MD5	`9d5667e9d3fb2bb4baa8a4a8c5842b7e`
BLAKE2b-256	`786bd3751b8555ec9bc531f1b856303b828a1719743be39487fc5fe0b4d53dca`

See more details on using hashes here.

open-news-api 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📰 open-news

🔁 Latest Updates

🎯 Features

📄 Article Extraction

📡 Live News Feeds

🔍 Google News Search

🔗 RSS Discovery

⚡ Smart Caching

🚀 Batch Processing & Summarization

📦 Installation

From GitHub

Direct Install

To install a specific version

🚀 Quick Start

1️⃣ Extract Article Content

2️⃣ Search Google News

3️⃣ Get Live News (Country-Specific)

4️⃣ Get Category News

5️⃣ Discover & Fetch RSS Feeds

6️⃣ Batch Fetch & Summarize Articles

7️⃣ Search & Summarize in One Call

8️⃣ Search a Single Domain

9️⃣ Force-Refresh Feeds & Clear Cache

🔟 Dedupe Articles

1️⃣1️⃣ JS-Heavy Pages (Optional)

🔀 Function Names: New vs Legacy

📚 API Reference

fetch_article(url: str) → Dict

search_news(query: str, limit: int = 10) → List[Dict]

get_live_news(country: str = None, category: str = "news", limit_per_feed: int = None) → List[Dict]

get_articles_from_website_rss(website_url: str, limit: int = 10) → List[Dict]

search_site(keyword: str, domain: str, limit: int = 10) → List[Dict]

live_news(..., force_refresh: bool = False, dedupe: bool = True, dedupe_fuzzy: bool = False)

clear_feed_cache(category: str = None, country: str = None) → None

dedupe_articles(articles: List[Dict], fuzzy: bool = False) → List[Dict]

list_categories() → List[str] / list_countries() → List[str]

fetch_and_summarize_batch(urls, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, timeout_per_article=30, js=False) → List[Dict]

fetch_and_summarize_search_results(query, limit=10, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, js=False) → List[Dict]

📡 RSS Feeds

Feed Sources

Using the Feeds

Contributing to Feeds

⚙️ Caching

🔧 Requirements

📝 License

🤝 Contributing

🙏 Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`fetch_article(url: str) → Dict`

`search_news(query: str, limit: int = 10) → List[Dict]`

`get_live_news(country: str = None, category: str = "news", limit_per_feed: int = None) → List[Dict]`

`get_articles_from_website_rss(website_url: str, limit: int = 10) → List[Dict]`

`search_site(keyword: str, domain: str, limit: int = 10) → List[Dict]`

`live_news(..., force_refresh: bool = False, dedupe: bool = True, dedupe_fuzzy: bool = False)`

`clear_feed_cache(category: str = None, country: str = None) → None`

`dedupe_articles(articles: List[Dict], fuzzy: bool = False) → List[Dict]`

`list_categories() → List[str]` / `list_countries() → List[str]`

`fetch_and_summarize_batch(urls, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, timeout_per_article=30, js=False) → List[Dict]`

`fetch_and_summarize_search_results(query, limit=10, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, js=False) → List[Dict]`