Skip to main content

Download and organize media from web archives

Project description

Media Archive Sync

CI PyPI version Python versions License: MIT

Download and organize media files from web archives (Apache-style directory listings).

Features

  • 🌐 Web Archive Crawling - Crawl Apache-style directory listings
  • 📥 Parallel Downloads - Download multiple files simultaneously with resume support
  • 📁 Smart Organization - Organize by date, custom naming conventions
  • 📝 Metadata Support - Generate NFO sidecar files
  • 🎬 Video Merging - Concatenate multipart video files
  • 💾 Resume Support - Cache progress and resume interrupted downloads
  • 🔧 Configurable - Everything is configurable, no hardcoded values

Installation

From PyPI (recommended)

pip install media-archive-sync

From source

pip install git+https://github.com/djdembeck/media-archive-sync.git

Development install

git clone https://github.com/djdembeck/media-archive-sync.git
cd media-archive-sync
pip install -e ".[dev]"

# Enable git hooks (runs lint/tests automatically)
git config --local core.hooksPath .githooks

Or use the Makefile for one-step setup:

make dev-install  # Installs deps and enables hooks

Quick Start

from media_archive_sync import ArchiveConfig, crawl_archive, download_files

# Configure
config = ArchiveConfig(
    remote_base="https://archive.example.com/vods/",
    local_root="./downloads",
    workers=5,
)

# Crawl remote archive
media_list, dir_counts = crawl_archive(config=config)
print(f"Found {len(media_list)} files")

# Download
 download_files(config=config)

CLI Usage

# Basic download
media-archive-sync --remote https://archive.example.com/vods/ --local ./downloads

# With organization by month
media-archive-sync --remote https://archive.example.com/vods/ --local ./media --organize

# Dry run (preview only)
media-archive-sync --remote https://archive.example.com/vods/ --dry-run

# Parallel downloads with 10 workers
media-archive-sync --remote https://archive.example.com/vods/ --workers 10

Docker Usage

# Pull image
docker pull ghcr.io/djdembeck/media-archive-sync:latest

# Run with mounted volumes
docker run --rm \
    -v /host/media:/media:rw \
    -v /host/cache:/app/.cache:rw \
    ghcr.io/djdembeck/media-archive-sync:latest \
    --remote https://archive.example.com/vods/ \
    --local /media

Configuration

See Configuration Guide for all available options.

Contributing

See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

media_archive_sync-0.3.0.tar.gz (78.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

media_archive_sync-0.3.0-py3-none-any.whl (48.8 kB view details)

Uploaded Python 3

File details

Details for the file media_archive_sync-0.3.0.tar.gz.

File metadata

  • Download URL: media_archive_sync-0.3.0.tar.gz
  • Upload date:
  • Size: 78.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for media_archive_sync-0.3.0.tar.gz
Algorithm Hash digest
SHA256 552807e85007c611e457dc046e6f924934bf6218754b6c6972ac480359b71551
MD5 41fbee7441eded0785b6275d3f0350db
BLAKE2b-256 31bd6d7bff00f40bd6f88f3955306c184304f3b7817261fb314e4ebf60028918

See more details on using hashes here.

Provenance

The following attestation bundles were made for media_archive_sync-0.3.0.tar.gz:

Publisher: publish.yml on djdembeck/media-archive-sync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file media_archive_sync-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for media_archive_sync-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 70c16651cb106e7baa990c9a49e7bda9bddd0e0784cf6feedbd11238367f4cd1
MD5 4fe9a152f524dba9a388c98be1737792
BLAKE2b-256 e0780a25b1dd877c6c90e0918e96b6b77a2ddfb00f422fcfe6d58dc20143e252

See more details on using hashes here.

Provenance

The following attestation bundles were made for media_archive_sync-0.3.0-py3-none-any.whl:

Publisher: publish.yml on djdembeck/media-archive-sync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page