Skip to main content

A Python package for archiving YouTube community posts with zero dependencies

Project description

Post Archiver Improved

Python License PyPI Downloads

A professional-grade Python package for archiving YouTube community posts with comprehensive data extraction capabilities. Built with zero external dependencies for maximum compatibility and reliability.

Post Archiver Improved is a complete rewrite of the original post-archiver project, featuring better architecture, robust error handling, and extensive testing coverage.

Key Features

  • Comprehensive Data Extraction - Complete archival of YouTube community posts with metadata preservation
  • Advanced Comment Processing - Full comment trees with reply chains and author information
  • High-Quality Image Archiving - Original resolution image downloads with metadata
  • Zero External Dependencies - Built entirely on Python standard library for maximum compatibility
  • Performance Optimized - Intelligent rate limiting and concurrent processing capabilities
  • Comprehensive Logging - Configurable logging levels with structured output and file rotation
  • Flexible Configuration - Multi-source configuration management (CLI, files, environment variables)
  • Progress Monitoring - Real-time progress tracking with detailed statistics and ETA
  • Comprehensive Reporting - Detailed summary reports with archival statistics and health metrics
  • Data Integrity - Automatic backup creation and data validation to prevent corruption
  • Robust Error Handling - Graceful failure recovery with detailed error reporting
  • Extensible Architecture - Modular design supporting custom extractors and output formats

Installation

From PyPI (Recommended)

pip install post-archiver-improved

From Source (Development)

git clone https://github.com/sadadYes/post-archiver-improved.git
cd post-archiver-improved
pip install -e .

Development Installation

git clone https://github.com/sadadYes/post-archiver-improved.git
cd post-archiver-improved
pip install -e ".[dev]"

Usage

Basic Usage

Archive all posts from a channel:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A

Archive with comments:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A --comments

Archive with images:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A --download-images

Archive a single post by post ID:

post-archiver UgkxMVl0vgxzNvE3I52s0oKlEHO3KyfocebU --comments --download-images

Archive a single post by URL:

post-archiver "https://www.youtube.com/post/UgkxMVl0vgxzNvE3I52s0oKlEHO3KyfocebU" --comments --download-images

Advanced Usage

Full archival with all features:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A \
  --comments \
  --download-images \
  --max-comments 500 \
  --max-replies 100 \
  --output ./archive \
  --verbose

Archive members-only content with cookies:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A \
  --comments \
  --download-images \
  --cookies ./cookies.txt \
  --output ./archive \
  --verbose

With custom configuration:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A \
  --config my_config.json \
  --log-file archive.log \
  --timeout 60 \
  --retries 5

Channel ID Formats

The tool accepts various channel ID formats:

  • Channel ID: UC5CwaMl1eIgY8h02uZw7u8A
  • Handle: @username
  • Channel URL: https://youtube.com/channel/UC5CwaMl1eIgY8h02uZw7u8A
  • Custom URL: https://youtube.com/c/channelname
  • Handle URL: https://youtube.com/@username

Individual Post Formats

You can also archive individual posts by providing:

  • Post ID: UgkxMVl0vgxzNvE3I52s0oKlEHO3KyfocebU
  • Post URL: https://www.youtube.com/post/UgkxMVl0vgxzNvE3I52s0oKlEHO3KyfocebU

When archiving individual posts, the tool automatically extracts the channel information and creates an archive containing just that specific post.

Accessing Members-Only Content

To access members-only posts, you'll need to provide authentication cookies from a logged-in YouTube session:

  1. Export Cookies: Use a browser extension or tool to export cookies in Netscape format

  2. Use Cookie File: Pass the cookie file to the archiver

    post-archiver UC5CwaMl1eIgY8h02uZw7u8A --cookies ./cookies.txt
    
  3. Cookie File Format: The tool expects Netscape HTTP Cookie File format:

    # Netscape HTTP Cookie File
    .youtube.com	TRUE	/	FALSE	1735689600	SIDCC	cookie_value
    .google.com	TRUE	/	TRUE	1735689600	__Secure-1PSIDCC	secure_value
    

Security Note: Cookie files contain sensitive authentication data. Keep them secure and never share them publicly.

Important: Cookies must be from a YouTube account that has membership access to the target channel.

Configuration

Command Line Options

Scraping Options

  • -n, --num-posts N - Maximum number of posts to scrape
  • -c, --comments - Extract comments for each post
  • --max-comments N - Maximum comments per post (default: 100)
  • --max-replies N - Maximum replies per comment (default: 200)
  • -i, --download-images - Download images to local directory
  • --cookies FILE - Path to Netscape format cookie file for accessing members-only posts

Output Options

  • -o, --output DIR - Output directory
  • --no-summary - Skip summary report creation
  • --compact - Save JSON without pretty printing

Network Options

  • --timeout SECONDS - Request timeout (default: 30)
  • --retries N - Maximum retry attempts (default: 3)
  • --delay SECONDS - Delay between requests (default: 1.0)

Logging Options

  • -v, --verbose - Enable verbose output (INFO level)
  • --debug - Enable debug output (DEBUG level)
  • --log-file FILE - Log to file in addition to console
  • --quiet - Suppress all output except errors

Configuration Files

Create a configuration file for repeated use:

{
  "scraping": {
    "max_posts": 100,
    "extract_comments": true,
    "max_comments_per_post": 200,
    "max_replies_per_comment": 50,
    "download_images": true,
    "request_timeout": 30,
    "max_retries": 3,
    "retry_delay": 1.0
  },
  "output": {
    "output_dir": "./archives",
    "pretty_print": true,
    "include_metadata": true
  },
  "log_file": "./logs/archiver.log"
}

Save current settings:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A --save-config my_config.json

Use saved configuration:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A --config my_config.json

Output Format

Archive File Structure

The tool creates a JSON file with the following structure:

{
  "channel_id": "UC5CwaMl1eIgY8h02uZw7u8A",
  "scrape_date": "2025-01-15T10:30:00",
  "scrape_timestamp": 1737888600,
  "posts_count": 25,
  "total_comments": 150,
  "total_images": 10,
  "images_downloaded": 10,
  "config_used": {...},
  "posts": [
    {
      "post_id": "UgxKp7...",
      "content": "Post content here...",
      "timestamp": "2 days ago",
      "timestamp_estimated": true,
      "likes": "42",
      "comments_count": "15",
      "members_only": false,
      "author": "Channel Name",
      "author_id": "UC5CwaMl1eIgY8h02uZw7u8A",
      "author_url": "https://youtube.com/channel/...",
      "author_thumbnail": "https://...",
      "author_is_verified": true,
      "author_is_member": false,
      "images": [
        {
          "src": "https://...",
          "local_path": "./images/post_123.jpg",
          "width": 1920,
          "height": 1080,
          "file_size": 245760
        }
      ],
      "links": [
        {
          "text": "Link text",
          "url": "https://..."
        }
      ],
      "comments": [
        {
          "id": "UgwKp7...",
          "text": "Comment text...",
          "like_count": "5",
          "timestamp": "1 day ago",
          "timestamp_estimated": true,
          "author_id": "UC...",
          "author": "Commenter Name",
          "author_thumbnail": "https://...",
          "author_is_verified": false,
          "author_is_member": true,
          "author_url": "https://...",
          "is_favorited": false,
          "is_pinned": false,
          "reply_count": "2",
          "replies": [...]
        }
      ]
    }
  ]
}

Files Created

  • posts_[CHANNEL_ID]_[TIMESTAMP].json - Main archive file
  • summary_[CHANNEL_ID]_[TIMESTAMP].txt - Summary report
  • images/ - Downloaded images (if enabled)
  • [LOG_FILE] - Log file (if specified)

Development

Project Structure

src/post_archiver_improved/
├── __init__.py              # Package initialization
├── api.py                   # YouTube API client
├── cli.py                   # Command-line interface
├── comment_processor.py     # Comment extraction logic
├── config.py                # Configuration management
├── exceptions.py            # Custom exception classes
├── extractors.py            # Data extraction utilities
├── logging_config.py        # Logging configuration
├── models.py                # Data models
├── output.py                # Output handling
├── scraper.py               # Main scraper logic
└── utils.py                 # Utility functions

Key Features

Modular Architecture

  • Separation of concerns with dedicated modules
  • Clean interfaces between components
  • Easy to extend and maintain

Robust Error Handling

  • Custom exception hierarchy for different error types
  • Graceful degradation when non-critical operations fail
  • Retry logic with exponential backoff

Comprehensive Logging

  • Configurable verbosity levels (ERROR, WARNING, INFO, DEBUG)
  • Colored console output for better readability
  • File logging with detailed tracebacks
  • Progress tracking with detailed statistics

Configuration Management

  • Multiple configuration sources (CLI args, config files, defaults)
  • Environment-specific settings support
  • Configuration validation and error reporting

Running Tests

# Install development dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=post_archiver_improved --cov-report=html

Troubleshooting

Common Issues

"No community tab found"

  • The channel might not have community posts enabled
  • Try using the channel's full URL instead of just the ID
  • Some channels restrict community tab access

"Rate limiting detected"

  • YouTube may be limiting requests
  • Increase the --delay parameter
  • Try again later

"Network timeout"

  • Check your internet connection
  • Increase the --timeout parameter
  • Use --retries to attempt multiple times

"Permission denied" for file operations

  • Check write permissions in the output directory
  • Make sure the output directory exists
  • Try running with appropriate permissions

Debug Mode

Enable debug mode for detailed troubleshooting:

post-archiver UC5CwaMl1eIgY8h02uZw7u8A --debug --log-file debug.log

This will provide detailed information about:

  • API requests and responses
  • Data extraction processes
  • File operations
  • Error stack traces

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

  1. Fork the repository
  2. Clone your fork
  3. Create a virtual environment
  4. Install in development mode: pip install -e ".[dev]"
  5. Make your changes
  6. Run tests: python -m pytest
  7. Submit a pull request

Coding Standards

  • Follow PEP 8 style guidelines
  • Add type hints to all functions
  • Write comprehensive docstrings
  • Include tests for new functionality
  • Update documentation as needed

TODO

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Acknowledgments

This project is heavily inspired by the yt-dlp community plugin by biggestsonicfan.

Support

If you encounter any issues or have questions:

  1. Check the troubleshooting section
  2. Search existing issues
  3. Create a new issue with:
    • Your command line arguments
    • Error messages or logs
    • System information (OS, Python version)
    • Expected vs actual behavior

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

post_archiver_improved-0.4.0.tar.gz (97.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

post_archiver_improved-0.4.0-py3-none-any.whl (63.2 kB view details)

Uploaded Python 3

File details

Details for the file post_archiver_improved-0.4.0.tar.gz.

File metadata

  • Download URL: post_archiver_improved-0.4.0.tar.gz
  • Upload date:
  • Size: 97.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for post_archiver_improved-0.4.0.tar.gz
Algorithm Hash digest
SHA256 8d5b013f8f7bbc7e68ca8d3a225936e7df609fdea2c68e9c528f4a57ce1ab435
MD5 a0ce2e747b0827176904df22d5735b04
BLAKE2b-256 473d9c84eeb721000566b841bb9c0411717d1c021c5d3c16bb0a802cf6873c55

See more details on using hashes here.

File details

Details for the file post_archiver_improved-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for post_archiver_improved-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 673869eaa36cf47bc774c2036a8bbb968495743a257cefadb52199298c02c6b3
MD5 651cf5be6ed6c4605b2fd1f58fa281b3
BLAKE2b-256 6bfb43567ca060ad781cdb01dfb01159cd99233957033ca54a70a8585fda5d0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page