Skip to main content

Advanced news feeds extractor and finder library. Helps to automatically extract news from websites without RSS/ATOM feeds

Project description

newsworker

PyPI version Python versions Documentation Status License: MIT

Turn any news page into an RSS/Atom feed — even when the site publishes no feed at all.

newsworker is a Python 3 library and command-line tool that extracts news feeds from plain HTML pages. It is built for the common case where a site publishes fresh news but offers no RSS/ATOM feed, and where generic "page change" monitors are too noisy to be useful.

The extracted feed can be emitted as JSON, RSS, Atom, CSV or OPML, so you can plug it straight into a feed reader, a pipeline, or your own storage.


Table of contents


How it works

The core idea is simple. Most news pages carry a publication date next to each item — 2017-09-27, 1 jul 2016, 18/06/2018, and hundreds of other variants. newsworker:

  1. Finds every date on the page using qddate, a fast pattern-based date parser that recognizes 340+ date formats across many languages.
  2. Clusters repeated, similarly-structured date nodes to tell apart a page date (footer, "last updated") from the news list area.
  3. Reconstructs each news item around its date node, pulling out the title, description, link and image.

The result is a structured feed you can serialize into whatever format you need.


Installation

pip install newsworker

Requires Python 3.7+. Installing from source:

git clone https://github.com/ivbeg/newsworker.git
cd newsworker
pip install -e .

Quick start

Extract a feed from a page and print it as RSS:

newsworker extract "https://www.eib.org/en/index.htm" --format rss

Discover feeds already published on a site and export them as an OPML subscription list:

newsworker scan "https://www.dta.gov.au/news/" --format opml --output feeds.opml

Or use it directly from Python:

from newsworker.extractor import FeedExtractor

extractor = FeedExtractor(filtered_text_length=150)
feed, session = extractor.get_feed(url="https://www.eib.org/en/index.htm")

for item in feed["items"]:
    print(item["pubdate"], item["title"])

Command-line interface

The package installs a single newsworker executable exposing five commands:

newsworker [COMMAND] [ARGS] [OPTIONS]

Commands:
  extract    Extract feed records from a web page
  serve      Run a local HTTP server exposing pages as RSS/Atom/JSON/CSV feeds
  scan       Scan a page and find existing feeds
  analyze    Analyze a page and generate a reusable YAML parsing spec
  parsedate  Parse a date/time string (debugging helper)

Add --verbose / -v to any command for detailed execution logs.

extract — build a feed from a page

Extracts news items from an HTML page and renders them in the chosen format.

newsworker extract URL [OPTIONS]
Option Alias Default Description
--format -f json Output format: json, rss, atom, csv.
--output -o (stdout) Write the result to a file instead of printing it.
--spec -s Path to a YAML spec produced by analyze. Uses fast deterministic extraction instead of the dynamic heuristics.
--no-cache false Bypass the spec and content caches for this run.
--refresh false Force re-fetching the page, ignoring cached content.
--config -c (default) Path to a settings YAML file (see Settings and caching).
--verbose -v false Verbose logging.

By default, extract builds a parsing spec dynamically on the first run for a URL and caches it, along with the fetched page content, under the configured cache directory. Subsequent runs reuse the cached spec (deterministic, fast) and the cached page (until its TTL expires). See Settings and caching.

Examples:

# Default JSON output
newsworker extract "https://example.com/news"

# RSS 2.0 to stdout
newsworker extract "https://example.com/news" -f rss

# Atom saved to a file
newsworker extract "https://example.com/news" -f atom -o feed.xml

# CSV table of items
newsworker extract "https://example.com/news" -f csv -o news.csv

# Fast, repeatable extraction using a pre-built spec
newsworker extract "https://example.com/news" -s example.yaml -f rss

# Ignore caches and re-fetch the page
newsworker extract "https://example.com/news" --refresh

serve — local feed server

Runs a lightweight local HTTP server (built on the Python standard library, no extra dependencies) that turns any page URL into a feed on demand over GET. Because the feed URLs are plain GET requests, you can paste them straight into any RSS reader and let it poll for updates.

newsworker serve [OPTIONS]
Option Alias Default Description
--host -h 127.0.0.1 Interface to bind. Overrides the settings value.
--port -p 8787 Port to listen on. Overrides the settings value.
--config -c (default) Path to a settings YAML file.
--cache-dir (settings) Directory for cached specs and page content.
--content-ttl (settings) Seconds a cached page stays fresh.
--verbose -v false Verbose logging.

Endpoints:

Route Description
GET /feed?url=<page>&format=atom Build a feed from <page>. format is one of atom (default), rss, json, csv. Add &refresh=1 to bypass the caches for one request.
GET /health Health check (returns ok).
GET / Short usage help.

Example — start the server and subscribe from a reader:

newsworker serve --port 8787

Then add this URL to your RSS reader (URL-encode the page URL):

http://127.0.0.1:8787/feed?url=https%3A%2F%2Fexample.com%2Fnews&format=atom

The first request for a URL builds and caches a parsing spec dynamically; later requests reuse the cached spec and serve the cached page content until its TTL expires, so the reader can poll frequently without hammering the source site.

scan — discover existing feeds

Scans a page for already-published RSS/Atom feeds (via autodiscovery links, feed icons and link heuristics) and reports them.

newsworker scan URL [OPTIONS]
Option Alias Default Description
--format -f json Output format: json, rss, atom, csv, opml.
--output -o (stdout) Write the result to a file instead of printing it.
--verbose -v false Verbose logging.

Examples:

# Default JSON list of discovered feeds
newsworker scan "https://www.dta.gov.au/news/"

# OPML subscription list ready to import into a feed reader
newsworker scan "https://www.dta.gov.au/news/" -f opml -o feeds.opml

# CSV table of discovered feeds
newsworker scan "https://www.dta.gov.au/news/" -f csv

# Represent each discovered feed as an entry in a single RSS/Atom feed
newsworker scan "https://www.dta.gov.au/news/" -f rss

Note: scan verifies every candidate feed by parsing it, so it may take longer than a raw link scan. feedtype, num_entries and language metadata are included where available.

analyze — generate a reusable spec

Runs the dynamic heuristics once and distills them into a portable YAML parsing spec. Feeding that spec back into extract --spec skips the expensive analysis step and runs deterministic selectors, which is far faster on repeat crawls of the same layout.

newsworker analyze URL [--output spec.yaml]
newsworker analyze "https://example.com/news" -o example.yaml
newsworker extract "https://example.com/news" -s example.yaml -f rss

parsedate — inspect date parsing

A debugging helper that shows how qddate interprets a date string.

newsworker parsedate "18/06/2018"

Settings and caching

Both extract and serve share a small caching layer that avoids redundant work:

  • Spec cache — the parsing spec for a URL is built dynamically on first use and stored as YAML. Subsequent runs reuse it (fast, deterministic).
  • Content cache — the fetched page bytes are stored with a configurable time-to-live, so a page is not re-downloaded on every request while it is still fresh.

Settings are read from a YAML file, by default ~/.newsworker/config.yaml (created with defaults on first run). Point to a different file with --config / -c.

cache_dir: ~/.newsworker/cache   # where cached specs and page content live
content_ttl: 3600                # seconds a cached page stays fresh
spec_ttl: 0                      # seconds a cached spec is valid (0 = never expires)
host: 127.0.0.1                  # local server bind interface
port: 8787                       # local server port
filtered_text_length: 150        # max text length considered for date detection

Cached specs live under <cache_dir>/specs/ and cached page content under <cache_dir>/content/, keyed by a hash of the source URL. Use --no-cache (bypass caches) or --refresh / ?refresh=1 (force a re-fetch) to override the caches for a single run/request.


Output formats

extract

Format Description
json The raw internal representation (feed metadata + items). Default.
rss RSS 2.0 document generated with feedgen.
atom Atom 1.0 document generated with feedgen.
csv Flat table of items: title, link, pubdate, description, image, unique_id.

scan

Format Description
json The raw list of discovered feeds. Default.
rss / atom Each discovered feed becomes an entry (its title and URL), so a feed reader can browse them.
csv Flat table: title, url, feedtype, num_entries, language, confidence.
opml OPML 2.0 subscription list — the standard interchange format for importing feeds into readers.

Dates coming from HTML are timezone-naive; when rendering RSS/Atom they are assumed to be UTC (a requirement of the feed formats).


Library usage

Extract a feed dynamically

from newsworker.extractor import FeedExtractor

extractor = FeedExtractor(filtered_text_length=150)
feed, session = extractor.get_feed(url="https://www.eib.org/en/index.htm")

feed is a dictionary shaped like:

{
    "title": "European Investment Bank (EIB)",
    "language": "en",
    "link": "https://www.eib.org/en/index.htm",
    "description": "European Investment Bank (EIB)",
    "items": [
        {
            "title": "Blockchain Challenge: coders at the EIB",
            "description": "...",
            "pubdate": datetime.datetime(2018, 6, 18, 0, 0),
            "unique_id": "f9d359f76118076c5331ffec3cdb82eb",
            "link": "https://www.youtube.com/watch?v=YlKa2LZgxhE",
            "extra": {"links": [...], "images": [...]},
            "raw_html": b"...",
        },
        # ...
    ],
    "cache": {"pats": ["dt:date:date_1"]},
}

Render a feed in any format

from newsworker.formats import format_feed

print(format_feed(feed, fmt="rss", public_url="https://example.com/feed.xml"))
print(format_feed(feed, fmt="atom"))
print(format_feed(feed, fmt="csv"))

Reuse cached date patterns (big speed-up)

Re-parsing the same site is dramatically faster if you reuse the date patterns discovered on the first pass — it narrows matching from ~350 patterns down to the 2–3 that actually occur:

pats = feed["cache"]["pats"]
feed, session = extractor.get_feed(
    url="https://www.eib.org/en/index.htm", cached_p=pats
)

Set a custom User-Agent

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 Chrome/23 Safari/537.11"
feed, session = extractor.get_feed(
    url="https://www.eib.org/en/index.htm", user_agent=USER_AGENT
)

Analyze once, extract fast (spec workflow)

from newsworker.spec import SpecAnalyzer, SpecExtractor, FeedSpec

# 1. Build and persist a spec.
spec = SpecAnalyzer(filtered_text_length=150).analyze("https://example.com/news")
spec.save("example.yaml")

# 2. Reuse it later with deterministic, low-overhead extraction.
spec = FeedSpec.load("example.yaml")
feed = SpecExtractor().extract("https://example.com/news", spec)

Find existing feeds on a page

from newsworker.finder import FeedsFinder

finder = FeedsFinder()

# Fast: collect candidate feed links without verifying them.
finder.find_feeds("https://www.dta.gov.au/news/")

# Verify each candidate by parsing it (slower, richer metadata).
finder.find_feeds("https://www.dta.gov.au/news/", noverify=False)
# {'url': 'https://www.dta.gov.au/news/',
#  'items': [{'title': 'Digital Transformation Agency',
#             'url': 'https://www.dta.gov.au/feed.xml',
#             'feedtype': 'rss', 'num_entries': 10}]}

# Fall back to HTML extraction when a page has no real feed.
finder.find_feeds("https://government.bg/bg/prestsentar/novini", extractrss=True)

# Include the parsed feed entries in the result.
finder.find_feeds("https://www.dta.gov.au/news/", noverify=False, include_entries=True)

You can also render discovered feeds with newsworker.formats.format_scan:

from newsworker.formats import format_scan

results = finder.find_feeds("https://www.dta.gov.au/news/", noverify=False)
print(format_scan(results, fmt="opml"))

Features

  • Identifies news blocks on arbitrary HTML pages using date patterns — 340+ patterns via qddate.
  • Very fast pattern matching built on pyparsing.
  • Discovers existing RSS/Atom feeds, and falls back to HTML extraction when none exist.
  • Multiple output formats for both extract (JSON, RSS, Atom, CSV) and scan (JSON, RSS, Atom, CSV, OPML).
  • Reusable YAML specs for fast, deterministic re-crawling of known layouts.
  • Pattern caching for repeated extraction from the same site.

Supported languages

Language-specific date recognition currently covers:

Bulgarian · Czech · English · French · German · Portuguese · Russian · Spanish


Performance

  • qddate was built specifically for this algorithm; pattern matching is already fast.
  • Cache date patterns (cached_p=...) to reuse the 2–3 patterns found on a site and skip the full pattern set on subsequent runs.
  • Prefer specs (analyzeextract --spec) for repeated crawls: deterministic selectors avoid re-running the discovery heuristics.
  • Feed discovery without verification (noverify=True) is fast; enabling verification parses every candidate and is slower.

Limitations

  • Not every language-specific date format is supported yet.
  • Right-aligned dates such as Published - 27-01-2018 are intentionally unsupported — supporting them measurably increases false positives.
  • Pages that expose no dates in item text or URLs are not yet supported.

Dependencies

Key runtime dependencies:


Documentation

Full documentation is built automatically and hosted on Read the Docs.


Contributing

Issues and pull requests are welcome. Please open an issue to discuss substantial changes before submitting a PR, and keep additions covered by the changelog.


License

Released under the MIT License. Copyright © Ivan Begtin.


Acknowledgements

This news-extraction code was first written in 2008 and has been refactored several times — most notably migrating from regular expressions to pyparsing. The original project was later split into two: the qddate date parsing library and newsworker for news identification on HTML pages.

Questions? Reach out at ivan@begtin.tech.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.1.0] - 2026-07-03

Added

  • New serve command running a local HTTP feed server (standard-library only, no extra dependencies) that turns any page URL into a feed on demand over GET (GET /feed?url=<page>&format=atom), plus /health and / endpoints. Feed URLs can be pasted straight into any RSS reader.
  • New analyze command that runs the dynamic heuristics once and distills them into a reusable YAML parsing spec, and a --spec / -s option for extract to run fast, deterministic extraction from a pre-built spec.
  • Reusable parsing specs via the new newsworker.spec module (FeedSpec, SpecAnalyzer, SpecExtractor) — deterministic CSS/XPath selectors that avoid re-running the discovery heuristics on known layouts.
  • Caching layer (newsworker.cache) with a spec cache and a content cache (configurable TTL), so extract and serve avoid rebuilding specs and re-fetching pages. New --no-cache, --refresh and --config / -c options for extract, and --cache-dir / --content-ttl for serve.
  • Settings support (newsworker.settings) backed by a YAML config file at ~/.newsworker/config.yaml (created with defaults on first run), controlling cache directory, TTLs, server host/port and detection parameters.
  • High-level newsworker.service.FeedService tying together caching, spec building and extraction; shared by both the extract command and the server.
  • Multiple output formats for the extract command via --format / -f: json (default), rss, atom and csv.
  • Multiple output formats for the scan command via --format / -f: json (default), rss, atom, csv and opml (subscription list).
  • --output / -o option for extract and scan to write results to a file instead of stdout.
  • New newsworker.formats module with format_feed() and format_scan() helpers (RSS/Atom generated via feedgen, plus CSV and OPML serializers).

Changed

  • Rewrote README.md with a modern structure: table of contents, CLI reference tables, output-format and caching documentation, and up-to-date library usage examples.
  • extract now builds and caches a parsing spec on first use (plus the fetched page content) so subsequent runs are faster; pass --spec to use an explicit spec.
  • scan now emits structured, format-aware output instead of a raw pretty-print.
  • Added cssselect, pyyaml, requests and urllib3 as dependencies (and declared feedparser explicitly in setup.py).
  • Moved PERFORMANCE_ANALYSIS.md under docs/ and removed the standalone AUTHORS.md (authorship is tracked in setup.py and the README).

Fixed

  • Naive datetimes are normalized to UTC when rendering RSS/Atom feeds, as required by the feed formats.

[1.0.1] - 2018-07-21

Added

  • First public release on PyPI and github

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newsworker-1.1.0.tar.gz (48.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newsworker-1.1.0-py2.py3-none-any.whl (43.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file newsworker-1.1.0.tar.gz.

File metadata

  • Download URL: newsworker-1.1.0.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for newsworker-1.1.0.tar.gz
Algorithm Hash digest
SHA256 b4a2128c84baf1a79bbe9dd941f87c073e1178197c95cd398f4ed2cfb591133a
MD5 1d7d932c306640c26974b8edbbe8011c
BLAKE2b-256 c774b6dcb8fbe48a713f1a4034ec0a1cb3c08d242aaad5d21f868e9159818694

See more details on using hashes here.

File details

Details for the file newsworker-1.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: newsworker-1.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for newsworker-1.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 3254f9f5b385c4c450ca66ac71048a5d832454415860d9f4bb1c244fbd732043
MD5 9cfbf7314262f3981fd1807fd3b66840
BLAKE2b-256 8ce445f5a74d45b72910be2de26252276db84305a321660968e29c586b976327

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page