Skip to main content

Minimal news fetching: article text, RSS, Google News search, RSS discovery

Project description

📰 open-news

Zero-Config News Fetching & Article Extraction for Python

License Python Status

A lightweight, batteries-included Python package for fetching news articles, extracting content, discovering RSS feeds, and batch processing with summarization.

FeaturesInstallationQuick StartAPI ReferenceContributing


🔁 Latest Updates

[18-06-2026 - Latest] - Packaging metadata for PyPI release (no functional changes), Release v0.1.1.

[17-06-2026] - Initial Stable Release v0.1.0

View more on our Changelog

🎯 Features

📄 Article Extraction

Pulls full text and metadata (title, authors, publish date, top image) straight from a page's HTML using a built-in lxml-based extractor — no third-party extraction library required.

📡 Live News Feeds

Access curated RSS feeds with zero local configuration:

  • 50+ country-specific feeds (India, USA, Pakistan, etc.)
  • Category feeds (business, politics, geopolitics)
  • Sourced from open-feeds

🔍 Google News Search

Search across Google News with decoded URLs:

  • Real article links (via googlenewsdecoder), with a graceful fallback to the raw redirect URL if decoding fails
  • Rich metadata included

🔗 RSS Discovery

Auto-discover RSS feeds from any website:

  • Built with BeautifulSoup + lxml
  • Fetch articles from discovered feeds instantly

⚡ Smart Caching

24-hour feed caching to minimize network requests and improve performance

🚀 Batch Processing & Summarization

Process multiple articles concurrently with built-in summarization:

  • Fetch and summarize batch URLs
  • Search Google News + fetch + summarize in one call
  • Configurable concurrency and timeouts
  • Lightweight extractive summarization

📦 Installation

Important note! - The actual package name is set to open-news-api insteed of project name open-news due to some PyPI issues.

From GitHub

git clone https://github.com/alphap365/open-news.git
cd open-news
pip install -e .

Direct Install

pip install open-news-api

To install a specific version

pip install open-news-api@v0.1.1 #Change the vtag with your choice tag

Dependencies installed automatically:

  • beautifulsoup4lxmlpython-dateutil
  • feedparsergooglenewsdecoderhttpxrequests

🚀 Quick Start

1️⃣ Extract Article Content

from open_news import fetch_article

article = fetch_article("https://www.bbc.com/news/world-us-canada-12345678")

print(article["title"])
print(article["text"][:500])
print(f"Source: {article['source']}")
print(f"Published: {article['publish_date']}")

2️⃣ Search Google News

from open_news import search_news

results = search_news("artificial intelligence", limit=5)

for article in results:
    print(f"✓ {article['title']}")
    print(f"  → {article['url']}\n")

3️⃣ Get Live News (Country-Specific)

from open_news import get_live_news

# Get top news from India
india_news = get_live_news(country="india", limit_per_feed=3)

for article in india_news:
    print(f"[{article['source']}] {article['title']}")
    print(f"Published: {article['published']}\n")

4️⃣ Get Category News

# Business news from curated feeds
business = get_live_news(category="business", limit_per_feed=2)

for article in business:
    print(f"{article['title']}")

5️⃣ Discover & Fetch RSS Feeds

from open_news import get_articles_from_website_rss

# Auto-discover RSS from any website
articles = get_articles_from_website_rss("https://techcrunch.com", limit=5)

for article in articles:
    print(f"✓ {article['title']}")

6️⃣ Batch Fetch & Summarize Articles

from open_news import fetch_and_summarize_batch

urls = [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3",
]

results = fetch_and_summarize_batch(urls, sentence_count=2, max_workers=3)

for result in results:
    if result["status"] == "success":
        print(f"📰 {result['title']}")
        print(f"   Summary: {result['summary']}\n")
    else:
        print(f"❌ Failed: {result['error']}")

7️⃣ Search & Summarize in One Call

from open_news import fetch_and_summarize_search_results

results = fetch_and_summarize_search_results(
    "climate change",
    limit=5,
    sentence_count=2,
    max_workers=3
)

for article in results:
    print(f"🔗 {article['url']}")
    print(f"📰 {article['title']}")
    print(f"   {article['summary']}\n")

📚 API Reference

fetch_article(url: str) → Dict

Extract article content and metadata from a given URL.

Returns:

{
    "url": str,            # Original article URL
    "title": str,          # Article headline
    "text": str,           # Full article text
    "authors": list,       # Author names, if found
    "publish_date": str,   # ISO 8601 timestamp, or None if undetected
    "top_image": str,      # Best-guess main image URL, or None
    "images": list,        # All image URLs found in the article body
    "videos": list,        # Embedded video URLs (YouTube, Vimeo, etc.)
    "source": str,         # Website domain
    "meta": dict           # Raw metadata: description, site name, keywords, JSON-LD
}

Example:

article = fetch_article("https://example.com/article")
if article["text"]:
    print(f"✓ Successfully extracted: {article['title']}")
else:
    print("✗ Could not extract article content")

A note on reliability: extraction quality depends entirely on how clean and structured the page's HTML is. Heavily templated sites with lots of navigation or ad markup around the article body may need some trial and error — if a result looks off, check meta and images for clues about what got picked up.


search_news(query: str, limit: int = 10) → List[Dict]

Search Google News for recent articles.

Parameters:

  • query (str): Search terms
  • limit (int): Maximum results to return (default: 10)

Returns:

[
    {
        "title": str,
        "url": str,           # Decoded real URL (when possible)
        "source": str,
        "published": str,     # ISO 8601 timestamp
        "description": str
    },
    ...
]

Example:

results = search_news("climate change", limit=5)
print(f"Found {len(results)} articles")

get_live_news(country: str = None, category: str = "news", limit_per_feed: int = None) → List[Dict]

Fetch articles from curated RSS feeds.

Parameters:

  • country (str, optional): Two-letter country code
    • Examples: "india", "usa", "uk", "pakistan"
    • When set, category is ignored
  • category (str): News category when no country specified
    • Options: "news", "business", "politics", "geopolitics"
    • Default: "news"
  • limit_per_feed (int, optional): Articles per feed (default from remote config)

Returns:

[
    {
        "title": str,
        "url": str,
        "source": str,
        "published": str,
        "description": str
    },
    ...
]

Examples:

# Country-specific
india_news = get_live_news(country="india", limit_per_feed=5)

# Category-specific
business = get_live_news(category="business")

# Default news
general = get_live_news()

get_articles_from_website_rss(website_url: str, limit: int = 10) → List[Dict]

Discover and fetch articles from a website's RSS feed.

Parameters:

  • website_url (str): Website homepage URL
  • limit (int): Maximum articles to return

Returns: Same structure as get_live_news()

Example:

articles = get_articles_from_website_rss("https://hackernews.com", limit=10)
for article in articles:
    print(f"• {article['title']}")

fetch_and_summarize_batch(urls: List[str], include_full_text: bool = False, sentence_count: int = 3, max_workers: int = 5, timeout_per_article: int = 30) → List[Dict]

Fetch and summarize multiple articles concurrently.

Parameters:

  • urls (List[str]): Article URLs to process
  • include_full_text (bool): Include full article text in results (default: False)
  • sentence_count (int): Sentences per summary (default: 3)
  • max_workers (int): Concurrent threads (default: 5, adjust based on CPU/network)
  • timeout_per_article (int): Timeout per article in seconds (default: 30)

Returns:

[
    {
        "url": str,
        "status": str,        # "success" or "failed"
        "title": str,         # Article title (empty string if failed)
        "summary": str,       # Summarized content (empty string if failed)
        "text": str,          # Full text (only present if include_full_text=True)
        "error": str          # Present only when status is "failed" — includes timeouts
    },
    ...
]

A timeout just shows up as a "failed" result with the timeout message in error — there's no separate "timeout" status, so check error if you need to distinguish why something failed.

Example:

from open_news import fetch_and_summarize_batch

urls = ["https://example.com/1", "https://example.com/2"]
results = fetch_and_summarize_batch(urls, sentence_count=2)

for result in results:
    if result["status"] == "success":
        print(f"✓ {result['title']}")
        print(f"  {result['summary']}")
    else:
        print(f"✗ {result['url']}: {result['error']}")

fetch_and_summarize_search_results(query: str, limit: int = 10, sentence_count: int = 3, **kwargs) → List[Dict]

Search Google News, fetch, and summarize all results in one call.

Parameters:

  • query (str): Search term
  • limit (int): Max results (default: 10)
  • sentence_count (int): Sentences per summary (default: 3)
  • **kwargs: Additional arguments passed to fetch_and_summarize_batch (e.g., max_workers, timeout)

Returns: Merged list combining search metadata with extracted & summarized content

Example:

from open_news import fetch_and_summarize_search_results

results = fetch_and_summarize_search_results(
    "artificial intelligence",
    limit=5,
    sentence_count=2,
    max_workers=3
)

for article in results:
    print(f"Title: {article['title']}")
    print(f"Summary: {article['summary']}")

📡 RSS Feeds

This package uses curated RSS feed definitions from the open-feeds repository.

Feed Sources

  • Country-specific feeds (India, USA, UK, Pakistan, etc.)
  • Category feeds: General news, Business, Politics, Geopolitics
  • All feeds are community-maintained and regularly tested

Using the Feeds

The get_live_news() function fetches feeds dynamically from the open-feeds repository, so you always get the latest available feeds.

Contributing to Feeds

To add new RSS feeds or report broken feeds, visit the open-feeds repository and follow their contributing guidelines.


⚙️ Caching

Feeds are automatically cached for 24 hours in ~/.open_news/feeds_cache/ to reduce network requests.

Current implementation: Cache is managed internally. Force refresh by clearing the cache directory if needed.


🔧 Requirements

  • Python: 3.7+
  • Network: Internet connection for live feeds

📝 License

Licensed under the MIT License – see LICENSE file for details.


🤝 Contributing

We'd love your contributions! Whether it's:

  • 🐛 Bug reports
  • ✨ Feature requests
  • 📝 Documentation improvements
  • 🔗 Feed suggestions (see open-feeds)
  • 💻 Pull requests

Please check out our Contributing Guide before getting started.

Ways to help:

  • Improve article extraction quality
  • Add language/region support
  • Write tests and documentation
  • Share and star the project ⭐
  • Contribute feeds to open-feeds

🙏 Acknowledgements

Built on the shoulders of amazing open-source projects:


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_news_api-0.1.1.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_news_api-0.1.1-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file open_news_api-0.1.1.tar.gz.

File metadata

  • Download URL: open_news_api-0.1.1.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for open_news_api-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a14aab4a209022052202c896c87dc7a6e0a7053b915d8aeef14d232da89f66e2
MD5 a7fca72ab3fe790a5bbdb46b83ac521b
BLAKE2b-256 b4f46bc0884fefedc210f60746dee05f1b9cca340546c5ad7d3b4e7bac61f2a8

See more details on using hashes here.

File details

Details for the file open_news_api-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: open_news_api-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for open_news_api-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4119b062afc46e922fd98ef743b7c11611650986ae65396ffbd0c8bf8d3bb4b8
MD5 d6f958ae73cf2416a809be408282e4b5
BLAKE2b-256 7ab5c95e0ab111b46fab25b6875f074109ad07036cc558a86cfbb70ab00b2424

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page