Minimal news fetching: article text, RSS, Google News search, RSS discovery
Project description
📰 open-news
Zero-Config News Fetching & Article Extraction for Python
A lightweight, batteries-included Python package for fetching news articles, extracting content, discovering RSS feeds, and batch processing with summarization.
Features • Installation • Quick Start • API Reference • Contributing
🔁 Latest Updates
[21-06-2026 - Latest] - Fixed a critical import bug, fixed
publish_dateextraction (3 code paths), addedjs=Truesupport and rotating User-Agents — Releasev0.1.2
[18-06-2026] - Packaging metadata for PyPI release (no functional changes), Release
v0.1.1.
[17-06-2026] - Initial Stable Release
v0.1.0
View more on our Changelog
🎯 Features
📄 Article ExtractionPulls full text and metadata (title, authors, publish date, top image) straight from a page's HTML using a built-in lxml-based extractor — no third-party extraction library required. |
📡 Live News FeedsAccess curated RSS feeds with zero local configuration:
|
🔍 Google News SearchSearch across Google News with decoded URLs:
|
🔗 RSS DiscoveryAuto-discover RSS feeds from any website:
|
⚡ Smart Caching24-hour feed caching to minimize network requests and improve performance |
🚀 Batch Processing & SummarizationProcess multiple articles concurrently with built-in summarization:
|
📦 Installation
Important note! - The actual package name is set to open-news-api insteed of project name open-news due to some PyPI issues.
From GitHub
git clone https://github.com/alphap365/open-news.git
cd open-news
pip install -e .
Direct Install
pip install open-news-api
To install a specific version
pip install open-news-api==v0.1.2 #Change the vtag with your choice tag
Dependencies installed automatically:
beautifulsoup4•lxml•python-dateutilfeedparser•googlenewsdecoder•httpx•requests
🚀 Quick Start
1️⃣ Extract Article Content
from open_news import fetch_article
article = fetch_article("https://www.bbc.com/news/world-us-canada-12345678")
print(article["title"])
print(article["text"][:500])
print(f"Source: {article['source']}")
print(f"Published: {article['publish_date']}")
2️⃣ Search Google News
from open_news import search_news
results = search_news("artificial intelligence", limit=5)
for article in results:
print(f"✓ {article['title']}")
print(f" → {article['url']}\n")
3️⃣ Get Live News (Country-Specific)
from open_news import get_live_news
# Get top news from India
india_news = get_live_news(country="india", limit_per_feed=3)
for article in india_news:
print(f"[{article['source']}] {article['title']}")
print(f"Published: {article['published']}\n")
4️⃣ Get Category News
# Business news from curated feeds
business = get_live_news(category="business", limit_per_feed=2)
for article in business:
print(f"{article['title']}")
5️⃣ Discover & Fetch RSS Feeds
from open_news import get_articles_from_website_rss
# Auto-discover RSS from any website
articles = get_articles_from_website_rss("https://techcrunch.com", limit=5)
for article in articles:
print(f"✓ {article['title']}")
6️⃣ Batch Fetch & Summarize Articles
from open_news import fetch_and_summarize_batch
urls = [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3",
]
results = fetch_and_summarize_batch(urls, sentence_count=2, max_workers=3)
for result in results:
if result["status"] == "success":
print(f"📰 {result['title']}")
print(f" Summary: {result['summary']}\n")
else:
print(f"❌ Failed: {result['error']}")
7️⃣ Search & Summarize in One Call
from open_news import fetch_and_summarize_search_results
results = fetch_and_summarize_search_results(
"climate change",
limit=5,
sentence_count=2,
max_workers=3
)
for article in results:
print(f"🔗 {article['url']}")
print(f"📰 {article['title']}")
print(f" {article['summary']}\n")
8️⃣ JS-Heavy Pages (Optional)
Some sites render their article body client-side and return little to nothing
in the raw HTML. For these, install the optional js extra:
pip install open-news-api[js]
playwright install chromium # one-time browser download
Then pass js=True to any fetch function:
from open_news import fetch_article
article = fetch_article("https://js-heavy-site.example.com/article", js=True)
js=True works on fetch_article, fetch_and_summarize_batch, and
fetch_and_summarize_search_results. It's slower per-request (a real browser
is launched), so consider lowering max_workers when batching with js=True.
If the js extra isn't installed, it logs a warning and transparently falls
back to the plain HTTP fetch instead of raising.
🔀 Function Names: New vs Legacy
Every function is available under two names — a short modern name and a longer legacy-style name (kept for backward compatibility with early releases). They are exact aliases; pick whichever reads better in your code.
| Short name | Legacy alias |
|---|---|
get_article |
fetch_article |
search |
search_news |
live_news |
get_live_news |
discover_and_get |
get_articles_from_website_rss |
batch_summarize |
fetch_and_summarize_batch |
search_and_summarize |
fetch_and_summarize_search_results |
Both forms are stable public API — neither is deprecated, and the docs below use the legacy names since they're more descriptive for newcomers, but feel free to import either.
📚 API Reference
fetch_article(url: str) → Dict
(alias: get_article)
Extract article content and metadata from a given URL.
Returns:
{
"url": str, # Original article URL
"title": str, # Article headline
"text": str, # Full article text
"authors": list, # Author names, if found
"publish_date": str, # ISO 8601 timestamp, or None if undetected
"top_image": str, # Best-guess main image URL, or None
"images": list, # All image URLs found in the article body
"videos": list, # Embedded video URLs (YouTube, Vimeo, etc.)
"source": str, # Website domain
"meta": dict # Raw metadata: description, site name, keywords, JSON-LD
}
Example:
article = fetch_article("https://example.com/article")
if article["text"]:
print(f"✓ Successfully extracted: {article['title']}")
else:
print("✗ Could not extract article content")
A note on reliability: extraction quality depends entirely on how clean and structured the page's HTML is. Heavily templated sites with lots of navigation or ad markup around the article body may need some trial and error — if a result looks off, check meta and images for clues about what got picked up.
search_news(query: str, limit: int = 10) → List[Dict]
(alias: search)
Search Google News for recent articles.
Parameters:
query(str): Search termslimit(int): Maximum results to return (default: 10)
Returns:
[
{
"title": str,
"url": str, # Decoded real URL (when possible)
"source": str,
"published": str, # ISO 8601 timestamp
"description": str
},
...
]
Example:
results = search_news("climate change", limit=5)
print(f"Found {len(results)} articles")
get_live_news(country: str = None, category: str = "news", limit_per_feed: int = None) → List[Dict]
(alias: live_news)
Fetch articles from curated RSS feeds.
Parameters:
country(str, optional): Two-letter country code- Examples:
"india","usa","uk","pakistan" - When set,
categoryis ignored
- Examples:
category(str): News category when no country specified- Options:
"news","business","politics","geopolitics" - Default:
"news"
- Options:
limit_per_feed(int, optional): Articles per feed (default from remote config)
Returns:
[
{
"title": str,
"url": str,
"source": str,
"published": str,
"description": str
},
...
]
Examples:
# Country-specific
india_news = get_live_news(country="india", limit_per_feed=5)
# Category-specific
business = get_live_news(category="business")
# Default news
general = get_live_news()
get_articles_from_website_rss(website_url: str, limit: int = 10) → List[Dict]
(alias: discover_and_get)
Discover and fetch articles from a website's RSS feed.
Parameters:
website_url(str): Website homepage URLlimit(int): Maximum articles to return
Returns: Same structure as get_live_news()
Example:
articles = get_articles_from_website_rss("https://hackernews.com", limit=10)
for article in articles:
print(f"• {article['title']}")
fetch_and_summarize_batch(urls, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, timeout_per_article=30, js=False) → List[Dict]
(alias: batch_summarize)
Parameters:
urls(List[str]): Article URLs to processsentence_count(int): Sentences per summary (default: 3)include_full_text(bool): Include full article text in results (default: False)include_images_videos(bool): Includeimages,videos, andtop_imagein results (default: False)max_workers(int): Concurrent threads (default: 5)timeout_per_article(int): Timeout per article in seconds (default: 30)js(bool): Render pages with a headless browser before extraction (default: False, requires[js]extra)
Returns:
[
{
"url": str,
"status": str, # "success" or "failed"
"title": str,
"summary": str,
"text": str, # only if include_full_text=True
"images": list, # only if include_images_videos=True
"videos": list, # only if include_images_videos=True
"top_image": str, # only if include_images_videos=True
"error": str # only present when status == "failed"
},
...
]
A timeout just shows up as a "failed" result with the timeout message in error — there's no separate "timeout" status, so check error if you need to distinguish why something failed.
Example:
from open_news import fetch_and_summarize_batch
urls = ["https://example.com/1", "https://example.com/2"]
results = fetch_and_summarize_batch(urls, sentence_count=2)
for result in results:
if result["status"] == "success":
print(f"✓ {result['title']}")
print(f" {result['summary']}")
else:
print(f"✗ {result['url']}: {result['error']}")
fetch_and_summarize_search_results(query, limit=10, sentence_count=3, include_full_text=False, include_images_videos=False, max_workers=5, js=False) → List[Dict]
(alias: search_and_summarize)
Parameters:
query(str): Search termlimit(int): Max results (default: 10)sentence_count(int): Sentences per summary (default: 3)include_full_text(bool): Include full text (default: False)include_images_videos(bool): Include images/videos (default: False)max_workers(int): Concurrent threads (default: 5)js(bool): Render pages with a headless browser (default: False)
Returns: Merged list combining search metadata with extracted & summarized content
Example:
from open_news import fetch_and_summarize_search_results
results = fetch_and_summarize_search_results(
"artificial intelligence",
limit=5,
sentence_count=2,
max_workers=3
)
for article in results:
print(f"Title: {article['title']}")
print(f"Summary: {article['summary']}")
📡 RSS Feeds
This package uses curated RSS feed definitions from the open-feeds repository.
Feed Sources
- Country-specific feeds (India, USA, UK, Pakistan, etc.)
- Category feeds: General news, Business, Politics, Geopolitics
- All feeds are community-maintained and regularly tested
Using the Feeds
The get_live_news() function fetches feeds dynamically from the open-feeds repository, so you always get the latest available feeds.
Contributing to Feeds
To add new RSS feeds or report broken feeds, visit the open-feeds repository and follow their contributing guidelines.
⚙️ Caching
Feeds are automatically cached for 24 hours in ~/.open_news/feeds_cache/ to reduce network requests.
Current implementation: Cache is managed internally. Force refresh by clearing the cache directory if needed.
🔧 Requirements
- Python: 3.7+
- Network: Internet connection for live feeds
📝 License
Licensed under the MIT License – see LICENSE file for details.
🤝 Contributing
We'd love your contributions! Whether it's:
- 🐛 Bug reports
- ✨ Feature requests
- 📝 Documentation improvements
- 🔗 Feed suggestions (see open-feeds)
- 💻 Pull requests
Please check out our Contributing Guide before getting started.
Ways to help:
- Improve article extraction quality
- Add language/region support
- Write tests and documentation
- Share and star the project ⭐
- Contribute feeds to open-feeds
🙏 Acknowledgements
Built on the shoulders of amazing open-source projects:
- feedparser – RSS parsing
- googlenewsdecoder – URL decoding
- BeautifulSoup4 – HTML parsing
- lxml – XML processing
- open-feeds – RSS feed curations
Made with ❤️ by Arajit Paul
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_news_api-0.1.2.tar.gz.
File metadata
- Download URL: open_news_api-0.1.2.tar.gz
- Upload date:
- Size: 30.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a2e579add78b82b0408f00db1d916e894b2bc56bd31dd4e3699c27a7025ab78
|
|
| MD5 |
798cf234c582b97dd42337360dd0482a
|
|
| BLAKE2b-256 |
6bbc0a02ab9f2e9fa2d8e5fa9e4bbfe3b110569b587a0b616cfe7f439e966a5e
|
File details
Details for the file open_news_api-0.1.2-py3-none-any.whl.
File metadata
- Download URL: open_news_api-0.1.2-py3-none-any.whl
- Upload date:
- Size: 22.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5849fdb5a687d0cc8764d6e7149cf04f1577e250c5133b951d41bc55e6a8a5d0
|
|
| MD5 |
be59f301b1432b68c7883229f6d6a459
|
|
| BLAKE2b-256 |
28e2c660b4cfd74c5dd7b0e3ef0ee0464ec2f11b220e6abd2585457d7ca96f91
|