Skip to main content

A toolkit for collecting, storing, and analyzing multiple blogs with RSS feed and web crawler support

Project description

Blog Toolkit

A comprehensive Python toolkit for collecting, storing, and analyzing multiple blogs with RSS feed and web crawler support.

Features

  • Multiple Collection Methods: Automatically detect and use RSS feeds, or fall back to web crawling
  • Comprehensive Analysis: Temporal patterns, content metrics, topic analysis, and sentiment analysis
  • Cross-Blog Comparison: Compare metrics across blogs by the same author or different authors
  • CLI Interface: Full-featured command-line interface for all operations
  • Web Dashboard: Interactive web interface with charts and visualizations
  • SQLite Storage: Local database for storing all blog data and metadata

Installation

This project uses UV for package management.

# Install UV if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup the project
cd blog-toolkit
uv sync

Quick Pull (One-Off, No Install)

Pull blog posts from any URL directly to a file—no database or setup:

uvx blog-toolkit pull https://example.substack.com -o ./posts.json

Requires uv. Output formats: --format json (default) or --format csv. Specify -o for output file or directory.

Quick Start

Using the CLI

# Add a blog (auto-detects RSS or uses crawler)
uv run blog-toolkit add https://example.com/blog

# Add a blog with specific method
uv run blog-toolkit add https://example.com/blog --method rss

# List all blogs
uv run blog-toolkit list

# Update a blog (collect new posts)
uv run blog-toolkit update --blog-id 1

# Update all blogs
uv run blog-toolkit update --all

# Analyze a blog
uv run blog-toolkit analyze --blog-id 1

# Analyze all blogs by an author
uv run blog-toolkit analyze --author "John Doe"

# Compare two blogs
uv run blog-toolkit compare 1 2

# Export data
uv run blog-toolkit export --format json --output data.json

Using the Web Dashboard

# Start the web server
uv run python -m blog_toolkit.web.app

# Or use the Flask CLI
uv run flask --app blog_toolkit.web.app run

Then open your browser to http://127.0.0.1:5000

Project Structure

blog-toolkit/
├── src/
│   └── blog_toolkit/
│       ├── config.py       # Configuration management
│       ├── database.py     # SQLite database models
│       ├── feeds.py        # RSS/Atom feed parser
│       ├── crawler.py      # Web crawler
│       ├── collector.py    # Unified collection interface
│       ├── analyzer.py     # Analysis engine
│       ├── cli.py          # CLI interface
│       └── web/            # Web dashboard
│           ├── app.py      # Flask application
│           └── templates/  # HTML templates
├── tests/                  # Test files
├── data/                   # Database storage (gitignored)
└── pyproject.toml          # Project configuration

Configuration

Copy .env.example to .env and customize settings:

cp .env.example .env

Key settings:

  • BLOG_TOOLKIT_DB: Database file path (default: data/blogs.db)
  • CRAWLER_MAX_DEPTH: Maximum crawl depth (default: 10)
  • REQUEST_TIMEOUT: HTTP request timeout in seconds (default: 30)
  • WEB_PORT: Web dashboard port (default: 5000)

Analysis Features

Temporal Analysis

  • Posting frequency (daily/weekly/monthly)
  • Posting patterns (time of day, day of week)
  • Gaps between posts
  • Date range analysis

Content Analysis

  • Word count distribution and trends
  • Reading time calculations
  • Content length over time

Topic Analysis

  • Keyword extraction
  • Tag and category distribution
  • Top keywords identification

Sentiment Analysis

  • Overall sentiment (positive/neutral/negative)
  • Per-post sentiment scores
  • Sentiment trends over time

Database Schema

  • blogs: Blog metadata (name, URL, feed URL, author, collection method)
  • posts: Individual blog posts (title, content, metadata, word count, etc.)
  • analyses: Cached analysis results for performance

CLI Commands

  • add <url> - Add a new blog
  • update [--blog-id <id>] [--all] - Update blog(s)
  • analyze [--blog-id <id>] [--author <name>] - Run analysis
  • list - List all blogs
  • show <blog-id> - Show blog details
  • compare <blog-id1> <blog-id2> - Compare two blogs
  • export [--format json|csv] [--output <file>] - Export data

Web Dashboard Features

  • Dashboard: Overview of all blogs, recent posts, statistics
  • Blog Detail: Individual blog view with posts, metrics, and charts
  • Author View: Aggregate view of all blogs by an author
  • Comparison View: Side-by-side comparison of blogs
  • Interactive Charts: Plotly charts for trends and metrics

Documentation

  • Feed Extraction Workarounds — Mechanisms for pulling RSS feed data from Substack and other platforms (platform limits, JS rendering, feed discovery, content parsing). Shareable guide for developers building similar tools.

Development

# Install development dependencies
uv sync --dev

# Run tests
uv run pytest

# Format code
uv run black src/

# Type checking
uv run mypy src/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blog_toolkit-0.1.0.tar.gz (119.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blog_toolkit-0.1.0-py3-none-any.whl (47.2 kB view details)

Uploaded Python 3

File details

Details for the file blog_toolkit-0.1.0.tar.gz.

File metadata

  • Download URL: blog_toolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 119.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for blog_toolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7bfb4fdad886061f70da27832b6eafd18fc6d805f3b8148ac23883b371e9af5c
MD5 3a4cf2e50107286f38273260f6877d7b
BLAKE2b-256 e52ab05bde2acc0e688b172ac3205790b8004eb5cd29b86960193e7637219bc7

See more details on using hashes here.

File details

Details for the file blog_toolkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: blog_toolkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 47.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for blog_toolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c769091ea84a46cedda9c54b428b42c04e5832e7debb84d2ddf39620ea605d3
MD5 be5176d8dc4fb7150bd15b76a73837c8
BLAKE2b-256 700edf8d36ccc544569cd3fcf7467d03dc95de14c911ab58bd5b605272a6da54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page