Skip to main content

A command-line tool designed to solve content preservation challenges with Ethical Scraping.

Project description

Capcat - Archive and Share Articles with Confidence

A dual-mode news archiving tool that captures articles from 12 curated sources as clean Markdown files (Obsidian-ready) with optional self-contained HTML output - perfect for knowledge management and offline sharing.

Why Capcat?

Build Your Knowledge Base: Every article saved as clean Markdown - drop directly into Obsidian for full-text search, backlinks, and graph views. Perfect for researchers and lifelong learners.

Share Without Breaking: Optional self-contained HTML output with all styles and scripts embedded. Send to anyone, open anywhere, years later - it just works.

Two Ways to Use:

  • Interactive Menu (./capcat catch) - Visual interface for browsing sources and bundles
  • Command Line - Fast automation for power users

Curated Bundles: Pre-configured collections like Tech, AI, Science, News - fetch multiple related sources at once.

Quick Start

Interactive Mode (Recommended)

./capcat catch

Choose from:

  • Fetch by Source - Browse 12 curated sources (Hacker News, BBC, IEEE, Nature, etc.)
  • Fetch by Bundle - Curated collections (Tech, AI, Science, News, Sports)
  • Single Article - Archive any URL instantly
  • Source Management - Add custom RSS/news sources

Command Line Mode

# Fetch curated tech bundle (IEEE + Mashable)
./capcat bundle tech --count 10

# Fetch specific sources with media
./capcat fetch hn,bbc --count 15 --media

# Archive a single article
./capcat single https://example.com/article

# List all available sources
./capcat list sources

Key Features

Self-Contained HTML for Easy Sharing

Every article is a complete, portable HTML file:

  • Embedded CSS - All styles inline, no external stylesheets
  • Embedded JavaScript - Interactive features work offline
  • Local Images - Downloaded and stored with the article
  • No Dependencies - Open in any browser, share via email, archive forever

Perfect for:

  • Email attachments that always look right
  • Long-term archiving without link rot
  • Offline reading on any device
  • Sharing articles that might disappear

Dual Interface

Interactive Menu (./capcat catch):

  • Visual source selection
  • Bundle browsing
  • Progress tracking
  • Error handling with retries
  • No commands to memorize

Command Line:

  • Fast automation and scripting
  • Batch processing
  • CI/CD integration
  • Power user workflows

Smart Content Extraction

  • 12 Curated Sources - HN, BBC, Guardian, Nature, IEEE, Scientific American, MIT News, and more
  • Intelligent Fallback - Finds images even when primary extraction misses them
  • Comment Preservation - Captures discussions with privacy anonymization
  • Media Handling - Images always downloaded, video/audio/PDFs with --media flag

Markdown-Native Output

  • Obsidian-Ready - Clean markdown files you can drop directly into your vault
  • Portable Archives - Standard markdown format works everywhere
  • Local Images - All media downloaded and referenced with relative paths
  • Metadata Headers - Source, date, and URL preserved in frontmatter-style headers

Bundle System

Pre-configured topic collections:

Bundle Sources Description
tech IEEE, Mashable Consumer technology news
techpro HN, Lobsters, InfoQ Professional developer news
ai MIT News, Google Research AI research and developments
science Nature, Scientific American Scientific publications
news BBC, Guardian General news
sports BBC Sport Sports coverage

Add your own bundles in sources/active/bundles.yml.

Installation

Quick Setup

# Clone the repository
git clone https://github.com/stayukasabov/capcat.git
cd capcat/Application

# Auto-fix dependencies (recommended)
./scripts/fix_dependencies.sh

# Or manual setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

First Run

# Launch interactive menu
./capcat catch

# Or try a quick fetch
./capcat fetch hn --count 5

Markdown-First Workflow (Obsidian Compatible)

Every article is saved as clean Markdown with proper formatting:

# Article Title

**Source**: Hacker News | **Date**: 2025-12-31 | **URL**: [Original Link]

## Content

Article body with images referenced locally...

![Image Description](../images/image.jpg)

Perfect for Knowledge Management:

  • Obsidian: Drag folders directly into your vault for full-text search and backlinks
  • Notion: Import markdown files while preserving structure
  • Logseq/Roam: Compatible with daily notes and graph views
  • Standard Editors: Works in VS Code, Typora, iA Writer, or any markdown editor

Metadata Included:

  • Source attribution
  • Publication date
  • Original URLs
  • Local image paths (relative linking)

Output Structure

Batch Mode (fetch/bundle)

../News/news_31-12-2025/
├── Hacker-News_31-12-2025/
│   ├── 01_Article_Title/
│   │   ├── article.md           # Primary markdown file
│   │   ├── html/
│   │   │   └── article.html     # Self-contained HTML with embedded CSS/JS
│   │   ├── images/
│   │   │   ├── content1.jpg
│   │   │   └── content2.png
│   │   └── comments.md          # Discussions (HN, Reddit sources)
│   └── 02_Another_Article/
└── BBC_31-12-2025/
    └── ...

Single Article Mode

../Capcats/cc_31-12-2025-Article-Title/
├── article.md                    # Standalone markdown
├── html/
│   └── article.html              # Complete standalone file
└── images/
    └── ...

Privacy & Ethics

Privacy-First Design:

  • Usernames anonymized as "Anonymous" in comments
  • Profile links preserved for reference
  • No personal data collection or storage
  • Only public content archived

Ethical Scraping:

  • Respects robots.txt
  • Rate limiting (1 request per 10 seconds)
  • Prefers RSS/APIs over HTML scraping
  • No paywall circumvention
  • Proper source attribution

Advanced Usage

Add Custom Sources

# Interactive source addition
./capcat add-source --url https://example.com/rss

# Or edit configuration
nano sources/active/config_driven/configs/newsource.yaml

Configuration Priority

  1. CLI arguments → 2. Environment variables → 3. capcat.yml → 4. Defaults

Example capcat.yml:

output_base_dir: "../MyNews"
max_workers: 8
download_media: true

Automation

# Daily tech news cron job
0 9 * * * cd /path/to/capcat && ./capcat bundle tech --count 20

# Weekly science digest
0 10 * * 0 cd /path/to/capcat && ./capcat bundle science --count 30 --media

Available Sources

Tech: Hacker News, Lobsters, InfoQ, IEEE Spectrum, Mashable

AI: Google Research, MIT News

News: BBC, The Guardian

Science: Nature, Scientific American

Sports: BBC Sport

See all: ./capcat list sources

Documentation

Full documentation at capcat.org:

Requirements

  • Python 3.8+
  • Internet connection
  • ~50MB disk space for application
  • Additional space for archived content

Troubleshooting

Dependencies issues?

./scripts/fix_dependencies.sh --force

Module not found?

./capcat list sources  # Wrapper handles venv activation

Source failing?

  • Check test-diagnose-*.md reports
  • Most sources use RSS/APIs for reliable, ethical access
  • Run ./capcat catch and try individual sources

Contributing

Contributions welcome! Open an issue or pull request on GitHub.

License

MIT License - See LICENSE.txt

Links


Archive with confidence. Share without limits.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

capcat-1.0.12.tar.gz (813.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

capcat-1.0.12-py3-none-any.whl (811.7 kB view details)

Uploaded Python 3

File details

Details for the file capcat-1.0.12.tar.gz.

File metadata

  • Download URL: capcat-1.0.12.tar.gz
  • Upload date:
  • Size: 813.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for capcat-1.0.12.tar.gz
Algorithm Hash digest
SHA256 3b3f18abb645b323eff0807980646cbb62f4b29c6ce7d25238fc16c98230c372
MD5 3fcb4fc20900f43f67ae4e62519c0d30
BLAKE2b-256 757322b3a20e0bf2f33f098d0ae31a23bcfb88e87b0c0b2eb36371e69f76affe

See more details on using hashes here.

File details

Details for the file capcat-1.0.12-py3-none-any.whl.

File metadata

  • Download URL: capcat-1.0.12-py3-none-any.whl
  • Upload date:
  • Size: 811.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for capcat-1.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 4f15c2d570d3925da517a4a9c128206323cc4e6f549ff0aa7c5aeed672683073
MD5 e4cc05784e59b7a7d8ca84c1c704415f
BLAKE2b-256 e17548059d7f0bfb6191186794e4758df55bd7b12a4f39ff662b8a96d102bd46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page