Skip to main content

A toolkit for collecting, storing, and analyzing multiple blogs with RSS feed and web crawler support

Project description

Blog Toolkit

A comprehensive Python toolkit for collecting, storing, and analyzing multiple blogs with RSS feed and web crawler support.

Features

  • Multiple Collection Methods: Automatically detect and use RSS feeds, or fall back to web crawling
  • Comprehensive Analysis: Temporal patterns, content metrics, topic analysis, and sentiment analysis
  • Cross-Blog Comparison: Compare metrics across blogs by the same author or different authors
  • CLI Interface: Full-featured command-line interface for all operations
  • Web Dashboard: Interactive web interface with charts and visualizations
  • SQLite Storage: Local database for storing all blog data and metadata

Installation

This project uses UV for package management.

# Install UV if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup the project
cd blog-toolkit
uv sync

Quick Pull (One-Off, No Install)

Pull blog posts from any URL directly to a file—no database or setup:

uvx blog-toolkit pull https://example.substack.com -o ./posts.json

Requires uv. Output formats: --format json (default) or --format csv. Specify -o for output file or directory.

Quick Start

Using the CLI

# Add a blog (auto-detects RSS or uses crawler)
uv run blog-toolkit add https://example.com/blog

# Add a blog with specific method
uv run blog-toolkit add https://example.com/blog --method rss

# List all blogs
uv run blog-toolkit list

# Update a blog (collect new posts)
uv run blog-toolkit update --blog-id 1

# Update all blogs
uv run blog-toolkit update --all

# Analyze a blog
uv run blog-toolkit analyze --blog-id 1

# Analyze all blogs by an author
uv run blog-toolkit analyze --author "John Doe"

# Compare two blogs
uv run blog-toolkit compare 1 2

# Export data
uv run blog-toolkit export --format json --output data.json

Using the Web Dashboard

# Start the web server
uv run python -m blog_toolkit.web.app

# Or use the Flask CLI
uv run flask --app blog_toolkit.web.app run

Then open your browser to http://127.0.0.1:5000

Project Structure

blog-toolkit/
├── src/
│   └── blog_toolkit/
│       ├── config.py       # Configuration management
│       ├── database.py     # SQLite database models
│       ├── feeds.py        # RSS/Atom feed parser
│       ├── crawler.py      # Web crawler
│       ├── collector.py    # Unified collection interface
│       ├── analyzer.py     # Analysis engine
│       ├── cli.py          # CLI interface
│       └── web/            # Web dashboard
│           ├── app.py      # Flask application
│           └── templates/  # HTML templates
├── tests/                  # Test files
├── data/                   # Database storage (gitignored)
└── pyproject.toml          # Project configuration

Configuration

Copy .env.example to .env and customize settings:

cp .env.example .env

Key settings:

  • BLOG_TOOLKIT_DB: Database file path (default: data/blogs.db)
  • CRAWLER_MAX_DEPTH: Maximum crawl depth (default: 10)
  • REQUEST_TIMEOUT: HTTP request timeout in seconds (default: 30)
  • WEB_PORT: Web dashboard port (default: 5000)

Analysis Features

Temporal Analysis

  • Posting frequency (daily/weekly/monthly)
  • Posting patterns (time of day, day of week)
  • Gaps between posts
  • Date range analysis

Content Analysis

  • Word count distribution and trends
  • Reading time calculations
  • Content length over time

Topic Analysis

  • Keyword extraction
  • Tag and category distribution
  • Top keywords identification

Sentiment Analysis

  • Overall sentiment (positive/neutral/negative)
  • Per-post sentiment scores
  • Sentiment trends over time

Database Schema

  • blogs: Blog metadata (name, URL, feed URL, author, collection method)
  • posts: Individual blog posts (title, content, metadata, word count, etc.)
  • analyses: Cached analysis results for performance

CLI Commands

  • add <url> - Add a new blog
  • update [--blog-id <id>] [--all] - Update blog(s)
  • analyze [--blog-id <id>] [--author <name>] - Run analysis
  • list - List all blogs
  • show <blog-id> - Show blog details
  • compare <blog-id1> <blog-id2> - Compare two blogs
  • export [--format json|csv] [--output <file>] - Export data

Web Dashboard Features

  • Dashboard: Overview of all blogs, recent posts, statistics
  • Blog Detail: Individual blog view with posts, metrics, and charts
  • Author View: Aggregate view of all blogs by an author
  • Comparison View: Side-by-side comparison of blogs
  • Interactive Charts: Plotly charts for trends and metrics

Documentation

  • Feed Extraction Workarounds — Mechanisms for pulling RSS feed data from Substack and other platforms (platform limits, JS rendering, feed discovery, content parsing). Shareable guide for developers building similar tools.

Development

# Install development dependencies
uv sync --dev

# Run tests
uv run pytest

# Format code
uv run black src/

# Type checking
uv run mypy src/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blog_toolkit-0.1.1.tar.gz (119.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blog_toolkit-0.1.1-py3-none-any.whl (47.4 kB view details)

Uploaded Python 3

File details

Details for the file blog_toolkit-0.1.1.tar.gz.

File metadata

  • Download URL: blog_toolkit-0.1.1.tar.gz
  • Upload date:
  • Size: 119.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for blog_toolkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 44192fe4756caa7fba8a9fed1cc7b4c42f333a0aa0c51d489876e094553090ae
MD5 9890615dc6b41ea581418feddf11682d
BLAKE2b-256 765d13ff3f73c2b53e4c3e82e03e3fa2a7729722e0a8c1b893cb13cb78ac87fe

See more details on using hashes here.

File details

Details for the file blog_toolkit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: blog_toolkit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 47.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for blog_toolkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f1408b7f93a7089b81951e39e120a8c148745f512f46e4bf906e58c75cd1f9ca
MD5 e30f7adce32012a160422b638c855186
BLAKE2b-256 1de5d1fabe90719e5b3137154e4bc966c611a9aa157c328de9a0c507b862df8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page