Skip to main content

Autonomous news ingestion and verification pipeline with AI agents

Project description

๐Ÿ—ž๏ธ news48

Autonomous news ingestion & verification pipeline with self-learning AI agents

Python 3.12+ License: MIT uv


Collect โ†’ Download โ†’ Parse โ†’ Fact-check โ€” on repeat, with agents that learn.


๐Ÿ“– Table of Contents


๐Ÿ” What Is It?

news48 is a self-hosted news pipeline that:

  1. Ingests RSS/Atom feeds from sources you choose
  2. Downloads full article content (with anti-bot bypass)
  3. Parses unstructured HTML into structured data via LLM
  4. Fact-checks claims against external evidence
  5. Purges stale data on a 48-hour retention window

All of this runs autonomously through four AI agents that schedule themselves via Dramatiq + Periodiq. The agents also learn from mistakes โ€” saving lessons that carry across runs so they get smarter over time.


๐ŸŒ Web Interface

news48 ships a FastAPI web interface that displays the last 48 hours of verified news. In Docker it's available at http://localhost:8765 (dev) or http://localhost:8000 (prod).

Pages

Route Description
/ Homepage โ€” top 10 stories, trending topics, expiring articles
/all All stories with optional tone filter (?sentiment=positive|neutral|negative)
/category/{slug} Category view with tone filter (e.g. /category/politics?sentiment=negative)
/article/{id}/{slug} Article detail with fact-check breakdown and related coverage
/cluster/{slug} Topic cluster โ€” all stories sharing a tag
/sitemap.xml Auto-generated XML sitemap
/health Health check endpoint

Features

  • AI-rewritten summaries โ€” clear, plain-English summaries for every parsed story
  • Fact-check breakdown โ€” per-claim verdicts (verified, disputed, mixed, unverifiable) with evidence
  • Tone filter โ€” filter stories by sentiment (positive, neutral, negative) across all pages
  • Trending topics โ€” auto-generated topic clusters from article tags
  • Expiring stories โ€” catch time-sensitive reporting before it leaves the 48-hour window
  • Deduplication โ€” same story from multiple sources is shown once per category
  • Category normalization โ€” consistent category names (e.g. artificial-intelligence and artificial intelligence are merged)
  • SEO-friendly โ€” canonical URLs, Open Graph tags, JSON-LD structured data, XML sitemap
  • Rate limiting โ€” 120 req/min general, 20 req/min for search
  • Security headers โ€” X-Content-Type-Options, X-Frame-Options, CSP, Referrer-Policy

๐Ÿ” Pipeline

 seed.txt โ”€โ”€โ–บ seed โ”€โ”€โ–บ fetch โ”€โ”€โ–บ download โ”€โ”€โ–บ parse โ”€โ”€โ–บ fact-check
                โ”‚          โ”‚          โ”‚           โ”‚           โ”‚
                โ–ผ          โ–ผ          โ–ผ           โ–ผ           โ–ผ
             DB feeds   DB articles  HTML โ†’ MD  structured   verdicts
                                                    data
Stage Command What it does
๐ŸŒฑ Seed news48 seed seed.txt Load feed URLs into the database
๐Ÿ“ก Fetch news48 fetch Pull RSS/Atom entries โ†’ store as articles
โฌ‡๏ธ Download news48 download Fetch full article HTML (with bypass)
๐Ÿงฉ Parse news48 parse Extract title, summary, categories, sentiment via LLM
๐Ÿ”ฌ Fact-check news48 fact-check Verify claims against evidence, record verdicts
๐Ÿงน Cleanup news48 cleanup purge Remove articles older than 48 hours

Most commands support --json for machine-readable output and --limit to control batch size.


๐Ÿค– Agents

Four agents run on schedules through Periodiq โ†’ Redis โ†’ Dramatiq:

Agent Cron What it does
Sentinel */5 * * * * Monitors health, creates fix plans, deletes bad feeds
Executor * * * * * Claims a plan, runs its steps, verifies outcomes
Parser * * * * * Claims articles, runs LLM parsing autonomously
Fact-checker */10 * * * * Verifies claims, searches evidence, records verdicts

๐Ÿง  Self-Learning

Agents save lessons when they discover something useful. On the next run, all accumulated lessons are injected into every agent's prompt:

Run 1:  Executor fails with wrong timeout โ†’ discovers 600s works โ†’ saves lesson
Run 2:  Executor starts with "timeout for fact-check should be 600s" already loaded

Lessons are stored in data/lessons.db (SQLite), cross-pollinated across agents, and human-auditable.

news48 lessons list                              # view all
news48 lessons list --agent executor --json      # filter by agent
news48 lessons add -a executor -c "Timing" -l "Use 600s timeout for fact-checks"

๐Ÿ“‹ CLI Reference

Pipeline Commands

news48 seed <file>              # Load feed URLs from file
news48 fetch                    # Pull RSS/Atom feeds
news48 download                 # Download article content
news48 parse                    # Parse articles with LLM
news48 fact-check               # Fact-check parsed articles
news48 stats                    # Show system statistics

Resource Management

# Feeds
news48 feeds list                          # List all feeds
news48 feeds add <url>                     # Add a feed
news48 feeds info <url-or-id>              # Feed details
news48 feeds update <url-or-id> -t "Title" # Update metadata
news48 feeds delete <url-or-id>            # Delete feed + articles
news48 feeds rss --hours 48 --output feed.xml  # Generate RSS

# Articles
news48 articles list --status parsed       # List by status
news48 articles info <id-or-url>           # Article details
news48 articles content <id-or-url>        # Show content
news48 articles update <id> --content-file <path>  # Update fields
news48 articles delete <id-or-url>         # Delete article
news48 articles reset <id> --all           # Reset failure flags
news48 articles feature <id>               # Mark as featured
news48 articles breaking <id>              # Mark as breaking
news48 articles check <id> -s verified     # Set fact-check result
news48 articles claims <id>                # Show per-claim verdicts

# Fetches
news48 fetches list                        # View fetch history

Search

news48 search articles "climate change"                    # Full-text search
news48 search articles "election" --sentiment negative -l 5  # Filtered

Agents & Plans

news48 agents status                 # Queue depths + cron schedules
news48 agents run -a parser          # Run one agent (enqueue to Dramatiq)
news48 agents run -a parser --inline # Run inline (debug, no Redis needed)

news48 plans list                    # List all plans
news48 plans list -s pending         # Filter by status
news48 plans show <plan-id>          # Show plan details
news48 plans cancel <plan-id>        # Cancel a plan
news48 plans remediate --apply       # Repair plan corruption

Observability

news48 lessons list                  # View agent lessons

Retention & Health

news48 cleanup status                # Retention policy stats
news48 cleanup purge                 # Purge old articles (default: 48h)
news48 cleanup purge --dry-run       # Preview without deleting
news48 cleanup health                # Database connectivity check

Web & MCP

news48 mcp serve                     # Start MCP server (stdio)
news48 mcp create-key --label "Dev"  # Create API key
news48 mcp list-keys                 # List active keys
news48 mcp revoke-key <key>          # Revoke a key

๐Ÿ’ก Tip: Append --json to any command for machine-readable output.


๐Ÿš€ Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager
  • An OpenAI-compatible LLM endpoint
  • A Byparr instance

Install

Option A โ€” Docker (recommended):

One-liner install:

curl -fsSL https://raw.githubusercontent.com/malvavisc0/news48/master/scripts/install.sh | bash

Or clone and run manually:

git clone https://github.com/malvavisc0/news48.git && cd news48
./scripts/install.sh

The interactive installer clones the repository, checks prerequisites, prompts for deployment mode (GPU or external LLM), generates secure passwords, and launches all services.

Option B โ€” Local (uv):

git clone https://github.com/malvavisc0/news48.git && cd news48
uv sync --extra all
cp .env.example .env
# Edit .env with your API keys (see table below)
uv run news48 --help

Extras:

uv sync --extra cli    # CLI + agents only
uv sync --extra web    # Web server only
uv sync --extra all    # Everything

Environment Variables

Variable Required Description
DATABASE_URL โœ… SQLAlchemy database URL (MySQL)
BYPARR_API_URL โœ… Byparr service URL
API_BASE โœ… LLM API base URL
API_KEY โœ… LLM API key
MODEL โœ… Model identifier
REDIS_URL Redis URL for Dramatiq (required for agents)
SEARXNG_URL SearXNG for fact-checker evidence search
CONTEXT_WINDOW Context window size (default: 1048576)
SMTP_HOST SMTP host for sentinel email alerts
SMTP_PORT SMTP port (default: 587)
SMTP_USER SMTP username
SMTP_PASS SMTP password
SMTP_FROM Sender email address
MONITOR_EMAIL_TO Recipient for sentinel alerts

Run It

# 1. Seed feeds
uv run news48 seed seed.txt

# 2. Fetch articles
uv run news48 fetch

# 3. Download content
uv run news48 download --limit 10

# 4. Parse with LLM
uv run news48 parse --limit 10

# 5. Check stats
uv run news48 stats

๐Ÿณ Docker

news48 runs entirely in Docker with separate containers for each service.

Services

Service Port Role
web 8000 FastAPI web interface
mysql 3306 Primary database
redis 6379 Dramatiq broker + RedisInsight (8001)
dramatiq-worker โ€” Executes agents and pipeline actors
periodiq-scheduler โ€” Enqueues scheduled work
searxng 8080โ€  Meta-search engine
byparr 8191โ€  Anti-bot bypass
dozzle 9999 Container log viewer (dev)

โ€  internal only

Development

# Start with live reload
docker compose up

# Web UI      โ†’ http://localhost:8765
# RedisInsight โ†’ http://localhost:8001
# Dozzle       โ†’ http://localhost:9999

# Run CLI inside container
docker compose exec dramatiq-worker news48 stats
docker compose exec dramatiq-worker news48 feeds list

# Logs
docker compose logs -f dramatiq-worker
docker compose logs -f web

# Stop
docker compose down        # keep data
docker compose down -v     # fresh start

Production

# Start
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Web UI โ†’ http://localhost:8000

# Backup
docker compose exec mysql mysqldump -unews48 -pnews48 news48 > backup.sql

# Update
docker compose -f docker-compose.yml -f docker-compose.prod.yml build --no-cache
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Stop
docker compose -f docker-compose.yml -f docker-compose.prod.yml down

Seeding in Docker

The sentinel agent auto-detects an empty database and creates a seed plan โ€” so if seed.txt is in the image, seeding happens automatically.

# Manual seed
docker compose exec dramatiq-worker news48 seed /app/seed.txt

# Verify
docker compose exec dramatiq-worker news48 feeds list

Worker Observability

  • RedisInsight โ†’ http://localhost:8001 โ€” inspect queues and broker state
  • Dozzle โ†’ http://localhost:9999 โ€” container log viewer
  • CLI โ†’ news48 agents status --json โ€” queue depths and cron schedules

๐Ÿ”Œ MCP Integration

news48 exposes tools via the Model Context Protocol so AI assistants can interact with your pipeline.

Local Server (stdio)

No auth required โ€” ideal for Claude Desktop, Cursor, etc.

uv run news48 mcp serve
{
  "mcpServers": {
    "news48": {
      "command": "news48",
      "args": ["mcp", "serve"]
    }
  }
}

Tools: fetch_feeds, list_feeds, search_articles, get_article_detail, get_stats, parse_article

Remote Endpoint (HTTP)

The web app exposes an authenticated endpoint at /mcp/. Keys are stored in Redis.

# Create a key
uv run news48 mcp create-key --label "Claude Desktop"
# โ†’ Created MCP API key: n48-aBcDeFgHiJkLmNoPqRsTuVwXyZ...
# โš ๏ธ  Copy it now โ€” it can't be retrieved later

# List keys (masked)
uv run news48 mcp list-keys

# Revoke a key
uv run news48 mcp revoke-key n48-...
{
  "mcpServers": {
    "news48-remote": {
      "url": "https://your-domain.com/mcp/",
      "headers": {
        "Authorization": "Bearer n48-your-api-key-here"
      }
    }
  }
}

Tools: browse_articles, get_topic_clusters, article_detail, web_stats

๐Ÿ”’ All keys are prefixed n48- for secret scanner detection. If Redis is unreachable, all MCP requests are denied (fail-closed).


๐Ÿงฌ Development

# Run tests
uv run pytest

# Format
uv run black .
uv run isort .

๐Ÿ“„ License

MIT โ€” see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

news48-0.2.0.tar.gz (209.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

news48-0.2.0-py3-none-any.whl (238.0 kB view details)

Uploaded Python 3

File details

Details for the file news48-0.2.0.tar.gz.

File metadata

  • Download URL: news48-0.2.0.tar.gz
  • Upload date:
  • Size: 209.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for news48-0.2.0.tar.gz
Algorithm Hash digest
SHA256 aea8e75c51e39b7d0ea84502bbb0ac06c8ced83a2415abec29b98cae39a80714
MD5 2ac551e791a565e84a8e5eac7d7923c4
BLAKE2b-256 4e96d956eb4dd71105a0c4e5009b0d41dc546638540cb7ca51efeb94f2575da4

See more details on using hashes here.

Provenance

The following attestation bundles were made for news48-0.2.0.tar.gz:

Publisher: ci.yml on malvavisc0/news48

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file news48-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: news48-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 238.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for news48-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b833927643cce7cf239ac1607121249f7a95f2355ebc784b9d69724793fee68
MD5 d3832b9ab4a70f6b9e676bd48c5df7df
BLAKE2b-256 a4b3e18c48539f091f827375138a08b11d3328d78c2d4e7bca0cf23c52df1c4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for news48-0.2.0-py3-none-any.whl:

Publisher: ci.yml on malvavisc0/news48

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page