Skip to main content

A command-line tool designed to solve content preservation challenges with Ethical Scraping.

Project description

Capcat — A command-line tool designed to solve content preservation challenges with Ethical Scraping.

Captures articles from 17 sources as clean Markdown files with optional self-contained HTML output. Supports interactive TUI and batch automation.

Installation

pipx install capcat

Requires Python 3.8+.

Quick Start

# Interactive TUI
capcat catch

# Fetch a bundle
capcat bundle tech --count 10

# Fetch specific sources
capcat fetch hn,bbc --count 15

# Archive a single article
capcat single https://example.com/article

# List available sources
capcat list sources

# Show version
capcat --version

Capcat initializes the vault automatically on first run.

Commands

Command Description
catch Launch the interactive TUI
single <url> Archive a single article
fetch <sources> Batch fetch from sources (comma-separated)
bundle <name> Fetch a pre-configured bundle
list sources List all available sources
list bundles List all available bundles
add-source --url <url> Add a custom RSS/news source
remove-source Remove a source
generate-config Generate a YAML config
init Explicitly scaffold vault (runs automatically on first use)

Options

Flag Description
--count N Number of articles to fetch (default: 30)
--output DIR Output directory (default: current dir)
--media Download images, video, audio, and PDF files
--pdfs Download PDF files only (independent of --media)
--html Generate self-contained HTML output
--update Re-fetch and update existing articles
-V, --verbose Verbose output
-q, --quiet Quiet output
-L <file> Log output to file
--version Show version and exit
--help Show help and exit

Bundles

Pre-configured topic collections:

Bundle Sources Description
tech IEEE, Mashable Consumer technology news
techpro HN, Lobsters, InfoQ Professional developer news
ai MIT News, Google Research AI research and developments
science Nature, Scientific American Scientific publications
news BBC, Guardian General news
sports BBC Sport Sports coverage

Available Sources

Tech Pro: Hacker News (hn), Lobsters (lb), InfoQ (iq)

Tech: IEEE Spectrum (ieee), Mashable (mashable)

AI: Google Research (google-research), MIT News (mitnews)

News: BBC (bbc), The Guardian (guardian)

Science: Nature (nature), Scientific American (scientificamerican)

Sports: BBC Sport (bbcsport)

Custom: Medium, Substack (add via capcat add-source)

Output Structure

Batch mode (fetch / bundle)

News/news_DD-MM-YYYY/
├── Hacker-News_DD-MM-YYYY/
│   ├── 01_Article_Title/
│   │   ├── article.md
│   │   ├── comments.md
│   │   ├── html/
│   │   │   ├── article.html
│   │   │   └── comments.html
│   │   └── images/
│   └── 02_Another_Article/
└── BBC_DD-MM-YYYY/

Single article mode

Capcats/cc_DD-MM-YYYY-Title/
├── article.md
├── html/
│   └── article.html
└── images/

HTML output is fully self-contained — embedded CSS, no external dependencies. Open in any browser, share via email, archive permanently.

Configuration

Optional capcat.yml in your project directory:

output_base_dir: "../MyNews"
max_workers: 8
download_media: false

Config priority: CLI flag, TUI prompt, per-source Config/sources/active/<source>/config.yaml, Config/Global-settings.yaml.

Automation

# Daily tech news
0 9 * * * cd ~/news && capcat bundle tech --count 20 --html

# Weekly science digest
0 10 * * 0 cd ~/news && capcat bundle science --count 30 --media

Privacy and Ethics

  • Usernames anonymized as "Anonymous" in comment archives
  • Respects robots.txt
  • Rate limiting: 1 request per 10 seconds
  • Prefers RSS/APIs over HTML scraping
  • No paywall circumvention
  • Proper source attribution

Documentation

Full documentation at capcat.org:

Contributing

Open an issue or pull request on GitHub.

License

MIT License — see LICENSE.txt

Links

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

capcat-1.9.67.tar.gz (387.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

capcat-1.9.67-py3-none-any.whl (414.7 kB view details)

Uploaded Python 3

File details

Details for the file capcat-1.9.67.tar.gz.

File metadata

  • Download URL: capcat-1.9.67.tar.gz
  • Upload date:
  • Size: 387.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for capcat-1.9.67.tar.gz
Algorithm Hash digest
SHA256 631d30a548efcdf49a86e0b845d99079397af0a202fa50c25eec62cb7c8c6c7d
MD5 ab2f1a7e9f1137a5cd43f0021a3cec20
BLAKE2b-256 303b1a3e94d7faf2047e69258bc530e6df3edcb32032bd5d1768f225b5735d6a

See more details on using hashes here.

File details

Details for the file capcat-1.9.67-py3-none-any.whl.

File metadata

  • Download URL: capcat-1.9.67-py3-none-any.whl
  • Upload date:
  • Size: 414.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for capcat-1.9.67-py3-none-any.whl
Algorithm Hash digest
SHA256 98ac30c6f63ecd19fb50c2265d1c88290e8e204cc9b51c5b5e8c16d92b19749a
MD5 43069f8c2b2199a529abd90c201fda0a
BLAKE2b-256 2f7ab8572a05167bbabbd525fca2e963b9518ac3e6cb8a92330796b962ce8654

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page