Skip to main content

Download and organize media from web archives

Project description

Media Archive Sync

CI PyPI version Python versions License: MIT

Download and organize media files from web archives (Apache-style directory listings).

Features

  • 🌐 Web Archive Crawling - Crawl Apache-style directory listings
  • 📥 Parallel Downloads - Download multiple files simultaneously with resume support
  • 📁 Smart Organization - Organize by date, custom naming conventions
  • 📝 Metadata Support - Generate NFO sidecar files
  • 🎬 Video Merging - Concatenate multipart video files
  • 💾 Resume Support - Cache progress and resume interrupted downloads
  • 🔧 Configurable - Everything is configurable, no hardcoded values

Installation

From PyPI (recommended)

pip install media-archive-sync

From source

pip install git+https://github.com/djdembeck/media-archive-sync.git

Development install

git clone https://github.com/djdembeck/media-archive-sync.git
cd media-archive-sync
pip install -e ".[dev]"

# Enable git hooks (runs lint/tests automatically)
git config --local core.hooksPath .githooks

Or use the Makefile for one-step setup:

make dev-install  # Installs deps and enables hooks

Quick Start

from media_archive_sync import ArchiveConfig, crawl_archive, download_files

# Configure
config = ArchiveConfig(
    remote_base="https://archive.example.com/vods/",
    local_root="./downloads",
    workers=5,
)

# Crawl remote archive
media_list, dir_counts = crawl_archive(config=config)
print(f"Found {len(media_list)} files")

# Download
 download_files(config=config)

CLI Usage

# Basic download
media-archive-sync --remote https://archive.example.com/vods/ --local ./downloads

# With organization by month
media-archive-sync --remote https://archive.example.com/vods/ --local ./media --organize

# Dry run (preview only)
media-archive-sync --remote https://archive.example.com/vods/ --dry-run

# Parallel downloads with 10 workers
media-archive-sync --remote https://archive.example.com/vods/ --workers 10

Docker Usage

# Pull image
docker pull ghcr.io/djdembeck/media-archive-sync:latest

# Run with mounted volumes
docker run --rm \
    -v /host/media:/media:rw \
    -v /host/cache:/app/.cache:rw \
    ghcr.io/djdembeck/media-archive-sync:latest \
    --remote https://archive.example.com/vods/ \
    --local /media

Configuration

See Configuration Guide for all available options.

Contributing

See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

media_archive_sync-0.2.0.tar.gz (75.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

media_archive_sync-0.2.0-py3-none-any.whl (47.9 kB view details)

Uploaded Python 3

File details

Details for the file media_archive_sync-0.2.0.tar.gz.

File metadata

  • Download URL: media_archive_sync-0.2.0.tar.gz
  • Upload date:
  • Size: 75.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for media_archive_sync-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1d68ca0cc3db74e7d16d9dfcc6afe4027acb619e5fddad72b7a095930a491db6
MD5 0958af9388d7233a6047bca5eed98bcf
BLAKE2b-256 faa400526a3f9e9a0fe381d42c68462e5030068c487470ee457b3d64c896cbae

See more details on using hashes here.

Provenance

The following attestation bundles were made for media_archive_sync-0.2.0.tar.gz:

Publisher: publish.yml on djdembeck/media-archive-sync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file media_archive_sync-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for media_archive_sync-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2553067d4b45cd8fcebdd1270d41d2eaa70991f5d2c5adbaa1184845c2c42238
MD5 f418883577c8f821e9d24bd38d3cb6ce
BLAKE2b-256 c4fa716f69a7a0d50c71359e70aa7b281b71aa523508ba091d8a0cc1dc4b138f

See more details on using hashes here.

Provenance

The following attestation bundles were made for media_archive_sync-0.2.0-py3-none-any.whl:

Publisher: publish.yml on djdembeck/media-archive-sync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page