Advanced news feeds extractor and finder library. Helps to automatically extract news from websites without RSS/ATOM feeds
Project description
newsworker
Turn any news page into an RSS/Atom feed — even when the site publishes no feed at all.
newsworker is a Python 3 library and command-line tool that extracts news feeds from plain HTML pages. It is built for the common case where a site publishes fresh news but offers no RSS/ATOM feed, and where generic "page change" monitors are too noisy to be useful.
The extracted feed can be emitted as JSON, RSS, Atom, CSV or OPML, so you can plug it straight into a feed reader, a pipeline, or your own storage.
Table of contents
- How it works
- Installation
- Quick start
- Command-line interface
- Settings and caching
- Output formats
- Library usage
- Features
- Supported languages
- Performance
- Limitations
- Dependencies
- Documentation
- Contributing
- License
- Acknowledgements
How it works
The core idea is simple. Most news pages carry a publication date next to each item — 2017-09-27, 1 jul 2016, 18/06/2018, and hundreds of other variants. newsworker:
- Finds every date on the page using qddate, a fast pattern-based date parser that recognizes 340+ date formats across many languages.
- Clusters repeated, similarly-structured date nodes to tell apart a page date (footer, "last updated") from the news list area.
- Reconstructs each news item around its date node, pulling out the title, description, link and image.
The result is a structured feed you can serialize into whatever format you need.
Installation
pip install newsworker
Requires Python 3.7+. Installing from source:
git clone https://github.com/ivbeg/newsworker.git
cd newsworker
pip install -e .
Quick start
Extract a feed from a page and print it as RSS:
newsworker extract "https://www.eib.org/en/index.htm" --format rss
Discover feeds already published on a site and export them as an OPML subscription list:
newsworker scan "https://www.dta.gov.au/news/" --format opml --output feeds.opml
Or use it directly from Python:
from newsworker.extractor import FeedExtractor
extractor = FeedExtractor(filtered_text_length=150)
feed, session = extractor.get_feed(url="https://www.eib.org/en/index.htm")
for item in feed["items"]:
print(item["pubdate"], item["title"])
Command-line interface
The package installs a single newsworker executable exposing five commands:
newsworker [COMMAND] [ARGS] [OPTIONS]
Commands:
extract Extract feed records from a web page
serve Run a local HTTP server exposing pages as RSS/Atom/JSON/CSV feeds
scan Scan a page and find existing feeds
analyze Analyze a page and generate a reusable YAML parsing spec
parsedate Parse a date/time string (debugging helper)
Add --verbose / -v to any command for detailed execution logs.
extract — build a feed from a page
Extracts news items from an HTML page and renders them in the chosen format.
newsworker extract URL [OPTIONS]
| Option | Alias | Default | Description |
|---|---|---|---|
--format |
-f |
json |
Output format: json, rss, atom, csv. |
--output |
-o |
(stdout) | Write the result to a file instead of printing it. |
--spec |
-s |
— | Path to a YAML spec produced by analyze. Uses fast deterministic extraction instead of the dynamic heuristics. |
--no-cache |
false |
Bypass the spec and content caches for this run. | |
--refresh |
false |
Force re-fetching the page, ignoring cached content. | |
--config |
-c |
(default) | Path to a settings YAML file (see Settings and caching). |
--verbose |
-v |
false |
Verbose logging. |
By default, extract builds a parsing spec dynamically on the first run for a
URL and caches it, along with the fetched page content, under the configured
cache directory. Subsequent runs reuse the cached spec (deterministic, fast) and
the cached page (until its TTL expires). See Settings and caching.
Examples:
# Default JSON output
newsworker extract "https://example.com/news"
# RSS 2.0 to stdout
newsworker extract "https://example.com/news" -f rss
# Atom saved to a file
newsworker extract "https://example.com/news" -f atom -o feed.xml
# CSV table of items
newsworker extract "https://example.com/news" -f csv -o news.csv
# Fast, repeatable extraction using a pre-built spec
newsworker extract "https://example.com/news" -s example.yaml -f rss
# Ignore caches and re-fetch the page
newsworker extract "https://example.com/news" --refresh
serve — local feed server
Runs a lightweight local HTTP server (built on the Python standard library, no extra dependencies) that turns any page URL into a feed on demand over GET. Because the feed URLs are plain GET requests, you can paste them straight into any RSS reader and let it poll for updates.
newsworker serve [OPTIONS]
| Option | Alias | Default | Description |
|---|---|---|---|
--host |
-h |
127.0.0.1 |
Interface to bind. Overrides the settings value. |
--port |
-p |
8787 |
Port to listen on. Overrides the settings value. |
--config |
-c |
(default) | Path to a settings YAML file. |
--cache-dir |
(settings) | Directory for cached specs and page content. | |
--content-ttl |
(settings) | Seconds a cached page stays fresh. | |
--verbose |
-v |
false |
Verbose logging. |
Endpoints:
| Route | Description |
|---|---|
GET /feed?url=<page>&format=atom |
Build a feed from <page>. format is one of atom (default), rss, json, csv. Add &refresh=1 to bypass the caches for one request. |
GET /health |
Health check (returns ok). |
GET / |
Short usage help. |
Example — start the server and subscribe from a reader:
newsworker serve --port 8787
Then add this URL to your RSS reader (URL-encode the page URL):
http://127.0.0.1:8787/feed?url=https%3A%2F%2Fexample.com%2Fnews&format=atom
The first request for a URL builds and caches a parsing spec dynamically; later requests reuse the cached spec and serve the cached page content until its TTL expires, so the reader can poll frequently without hammering the source site.
scan — discover existing feeds
Scans a page for already-published RSS/Atom feeds (via autodiscovery links, feed icons and link heuristics) and reports them.
newsworker scan URL [OPTIONS]
| Option | Alias | Default | Description |
|---|---|---|---|
--format |
-f |
json |
Output format: json, rss, atom, csv, opml. |
--output |
-o |
(stdout) | Write the result to a file instead of printing it. |
--verbose |
-v |
false |
Verbose logging. |
Examples:
# Default JSON list of discovered feeds
newsworker scan "https://www.dta.gov.au/news/"
# OPML subscription list ready to import into a feed reader
newsworker scan "https://www.dta.gov.au/news/" -f opml -o feeds.opml
# CSV table of discovered feeds
newsworker scan "https://www.dta.gov.au/news/" -f csv
# Represent each discovered feed as an entry in a single RSS/Atom feed
newsworker scan "https://www.dta.gov.au/news/" -f rss
Note:
scanverifies every candidate feed by parsing it, so it may take longer than a raw link scan.feedtype,num_entriesandlanguagemetadata are included where available.
analyze — generate a reusable spec
Runs the dynamic heuristics once and distills them into a portable YAML parsing spec. Feeding that spec back into extract --spec skips the expensive analysis step and runs deterministic selectors, which is far faster on repeat crawls of the same layout.
newsworker analyze URL [--output spec.yaml]
newsworker analyze "https://example.com/news" -o example.yaml
newsworker extract "https://example.com/news" -s example.yaml -f rss
parsedate — inspect date parsing
A debugging helper that shows how qddate interprets a date string.
newsworker parsedate "18/06/2018"
Settings and caching
Both extract and serve share a small caching layer that avoids redundant
work:
- Spec cache — the parsing spec for a URL is built dynamically on first use and stored as YAML. Subsequent runs reuse it (fast, deterministic).
- Content cache — the fetched page bytes are stored with a configurable time-to-live, so a page is not re-downloaded on every request while it is still fresh.
Settings are read from a YAML file, by default ~/.newsworker/config.yaml
(created with defaults on first run). Point to a different file with
--config / -c.
cache_dir: ~/.newsworker/cache # where cached specs and page content live
content_ttl: 3600 # seconds a cached page stays fresh
spec_ttl: 0 # seconds a cached spec is valid (0 = never expires)
host: 127.0.0.1 # local server bind interface
port: 8787 # local server port
filtered_text_length: 150 # max text length considered for date detection
Cached specs live under <cache_dir>/specs/ and cached page content under
<cache_dir>/content/, keyed by a hash of the source URL. Use --no-cache
(bypass caches) or --refresh / ?refresh=1 (force a re-fetch) to override the
caches for a single run/request.
Output formats
extract
| Format | Description |
|---|---|
json |
The raw internal representation (feed metadata + items). Default. |
rss |
RSS 2.0 document generated with feedgen. |
atom |
Atom 1.0 document generated with feedgen. |
csv |
Flat table of items: title, link, pubdate, description, image, unique_id. |
scan
| Format | Description |
|---|---|
json |
The raw list of discovered feeds. Default. |
rss / atom |
Each discovered feed becomes an entry (its title and URL), so a feed reader can browse them. |
csv |
Flat table: title, url, feedtype, num_entries, language, confidence. |
opml |
OPML 2.0 subscription list — the standard interchange format for importing feeds into readers. |
Dates coming from HTML are timezone-naive; when rendering RSS/Atom they are assumed to be UTC (a requirement of the feed formats).
Library usage
Extract a feed dynamically
from newsworker.extractor import FeedExtractor
extractor = FeedExtractor(filtered_text_length=150)
feed, session = extractor.get_feed(url="https://www.eib.org/en/index.htm")
feed is a dictionary shaped like:
{
"title": "European Investment Bank (EIB)",
"language": "en",
"link": "https://www.eib.org/en/index.htm",
"description": "European Investment Bank (EIB)",
"items": [
{
"title": "Blockchain Challenge: coders at the EIB",
"description": "...",
"pubdate": datetime.datetime(2018, 6, 18, 0, 0),
"unique_id": "f9d359f76118076c5331ffec3cdb82eb",
"link": "https://www.youtube.com/watch?v=YlKa2LZgxhE",
"extra": {"links": [...], "images": [...]},
"raw_html": b"...",
},
# ...
],
"cache": {"pats": ["dt:date:date_1"]},
}
Render a feed in any format
from newsworker.formats import format_feed
print(format_feed(feed, fmt="rss", public_url="https://example.com/feed.xml"))
print(format_feed(feed, fmt="atom"))
print(format_feed(feed, fmt="csv"))
Reuse cached date patterns (big speed-up)
Re-parsing the same site is dramatically faster if you reuse the date patterns discovered on the first pass — it narrows matching from ~350 patterns down to the 2–3 that actually occur:
pats = feed["cache"]["pats"]
feed, session = extractor.get_feed(
url="https://www.eib.org/en/index.htm", cached_p=pats
)
Set a custom User-Agent
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 Chrome/23 Safari/537.11"
feed, session = extractor.get_feed(
url="https://www.eib.org/en/index.htm", user_agent=USER_AGENT
)
Analyze once, extract fast (spec workflow)
from newsworker.spec import SpecAnalyzer, SpecExtractor, FeedSpec
# 1. Build and persist a spec.
spec = SpecAnalyzer(filtered_text_length=150).analyze("https://example.com/news")
spec.save("example.yaml")
# 2. Reuse it later with deterministic, low-overhead extraction.
spec = FeedSpec.load("example.yaml")
feed = SpecExtractor().extract("https://example.com/news", spec)
Find existing feeds on a page
from newsworker.finder import FeedsFinder
finder = FeedsFinder()
# Fast: collect candidate feed links without verifying them.
finder.find_feeds("https://www.dta.gov.au/news/")
# Verify each candidate by parsing it (slower, richer metadata).
finder.find_feeds("https://www.dta.gov.au/news/", noverify=False)
# {'url': 'https://www.dta.gov.au/news/',
# 'items': [{'title': 'Digital Transformation Agency',
# 'url': 'https://www.dta.gov.au/feed.xml',
# 'feedtype': 'rss', 'num_entries': 10}]}
# Fall back to HTML extraction when a page has no real feed.
finder.find_feeds("https://government.bg/bg/prestsentar/novini", extractrss=True)
# Include the parsed feed entries in the result.
finder.find_feeds("https://www.dta.gov.au/news/", noverify=False, include_entries=True)
You can also render discovered feeds with newsworker.formats.format_scan:
from newsworker.formats import format_scan
results = finder.find_feeds("https://www.dta.gov.au/news/", noverify=False)
print(format_scan(results, fmt="opml"))
Features
- Identifies news blocks on arbitrary HTML pages using date patterns — 340+ patterns via qddate.
- Very fast pattern matching built on
pyparsing. - Discovers existing RSS/Atom feeds, and falls back to HTML extraction when none exist.
- Multiple output formats for both
extract(JSON, RSS, Atom, CSV) andscan(JSON, RSS, Atom, CSV, OPML). - Reusable YAML specs for fast, deterministic re-crawling of known layouts.
- Pattern caching for repeated extraction from the same site.
Supported languages
Language-specific date recognition currently covers:
Bulgarian · Czech · English · French · German · Portuguese · Russian · Spanish
Performance
- qddate was built specifically for this algorithm; pattern matching is already fast.
- Cache date patterns (
cached_p=...) to reuse the 2–3 patterns found on a site and skip the full pattern set on subsequent runs. - Prefer specs (
analyze→extract --spec) for repeated crawls: deterministic selectors avoid re-running the discovery heuristics. - Feed discovery without verification (
noverify=True) is fast; enabling verification parses every candidate and is slower.
Limitations
- Not every language-specific date format is supported yet.
- Right-aligned dates such as
Published - 27-01-2018are intentionally unsupported — supporting them measurably increases false positives. - Pages that expose no dates in item text or URLs are not yet supported.
Dependencies
Key runtime dependencies:
- qddate — fast date parsing (the heart of the algorithm).
- pyparsing — text pattern matching.
- lxml + cssselect — HTML parsing and selectors.
- feedgen — RSS/Atom generation.
- feedparser — parsing discovered feeds.
- typer — the command-line interface.
- requests, pyyaml, beautifulsoup4.
Documentation
Full documentation is built automatically and hosted on Read the Docs.
Contributing
Issues and pull requests are welcome. Please open an issue to discuss substantial changes before submitting a PR, and keep additions covered by the changelog.
License
Released under the MIT License. Copyright © Ivan Begtin.
Acknowledgements
This news-extraction code was first written in 2008 and has been refactored several
times — most notably migrating from regular expressions to pyparsing. The original
project was later split into two: the qddate date
parsing library and newsworker for news identification on HTML pages.
Questions? Reach out at ivan@begtin.tech.
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[1.1.0] - 2026-07-03
Added
- New
servecommand running a local HTTP feed server (standard-library only, no extra dependencies) that turns any page URL into a feed on demand over GET (GET /feed?url=<page>&format=atom), plus/healthand/endpoints. Feed URLs can be pasted straight into any RSS reader. - New
analyzecommand that runs the dynamic heuristics once and distills them into a reusable YAML parsing spec, and a--spec/-soption forextractto run fast, deterministic extraction from a pre-built spec. - Reusable parsing specs via the new
newsworker.specmodule (FeedSpec,SpecAnalyzer,SpecExtractor) — deterministic CSS/XPath selectors that avoid re-running the discovery heuristics on known layouts. - Caching layer (
newsworker.cache) with a spec cache and a content cache (configurable TTL), soextractandserveavoid rebuilding specs and re-fetching pages. New--no-cache,--refreshand--config/-coptions forextract, and--cache-dir/--content-ttlforserve. - Settings support (
newsworker.settings) backed by a YAML config file at~/.newsworker/config.yaml(created with defaults on first run), controlling cache directory, TTLs, server host/port and detection parameters. - High-level
newsworker.service.FeedServicetying together caching, spec building and extraction; shared by both theextractcommand and the server. - Multiple output formats for the
extractcommand via--format/-f:json(default),rss,atomandcsv. - Multiple output formats for the
scancommand via--format/-f:json(default),rss,atom,csvandopml(subscription list). --output/-ooption forextractandscanto write results to a file instead of stdout.- New
newsworker.formatsmodule withformat_feed()andformat_scan()helpers (RSS/Atom generated viafeedgen, plus CSV and OPML serializers).
Changed
- Rewrote
README.mdwith a modern structure: table of contents, CLI reference tables, output-format and caching documentation, and up-to-date library usage examples. extractnow builds and caches a parsing spec on first use (plus the fetched page content) so subsequent runs are faster; pass--specto use an explicit spec.scannow emits structured, format-aware output instead of a raw pretty-print.- Added
cssselect,pyyaml,requestsandurllib3as dependencies (and declaredfeedparserexplicitly insetup.py). - Moved
PERFORMANCE_ANALYSIS.mdunderdocs/and removed the standaloneAUTHORS.md(authorship is tracked insetup.pyand the README).
Fixed
- Naive datetimes are normalized to UTC when rendering RSS/Atom feeds, as required by the feed formats.
[1.0.1] - 2018-07-21
Added
- First public release on PyPI and github
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file newsworker-1.1.0.tar.gz.
File metadata
- Download URL: newsworker-1.1.0.tar.gz
- Upload date:
- Size: 48.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4a2128c84baf1a79bbe9dd941f87c073e1178197c95cd398f4ed2cfb591133a
|
|
| MD5 |
1d7d932c306640c26974b8edbbe8011c
|
|
| BLAKE2b-256 |
c774b6dcb8fbe48a713f1a4034ec0a1cb3c08d242aaad5d21f868e9159818694
|
File details
Details for the file newsworker-1.1.0-py2.py3-none-any.whl.
File metadata
- Download URL: newsworker-1.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 43.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3254f9f5b385c4c450ca66ac71048a5d832454415860d9f4bb1c244fbd732043
|
|
| MD5 |
9cfbf7314262f3981fd1807fd3b66840
|
|
| BLAKE2b-256 |
8ce445f5a74d45b72910be2de26252276db84305a321660968e29c586b976327
|