Skip to main content

Download and organize media from web archives

Project description

Media Archive Sync

CI PyPI version Python versions License: MIT

Download and organize media files from web archives (Apache-style directory listings).

Features

  • 🌐 Web Archive Crawling - Crawl Apache-style directory listings
  • 📥 Parallel Downloads - Download multiple files simultaneously with resume support
  • 📁 Smart Organization - Organize by date, custom naming conventions
  • 📝 Metadata Support - Generate NFO sidecar files
  • 🎬 Video Merging - Concatenate multipart video files
  • 💾 Resume Support - Cache progress and resume interrupted downloads
  • 🔧 Configurable - Everything is configurable, no hardcoded values

Installation

From PyPI (recommended)

pip install media-archive-sync

From source

pip install git+https://github.com/djdembeck/media-archive-sync.git

Development install

git clone https://github.com/djdembeck/media-archive-sync.git
cd media-archive-sync
pip install -e ".[dev]"

# Enable git hooks (runs lint/tests automatically)
git config --local core.hooksPath .githooks

Or use the Makefile for one-step setup:

make dev-install  # Installs deps and enables hooks

Quick Start

from media_archive_sync import ArchiveConfig, crawl_archive, download_files

# Configure
config = ArchiveConfig(
    remote_base="https://archive.example.com/vods/",
    local_root="./downloads",
    workers=5,
)

# Crawl remote archive
media_list, dir_counts = crawl_archive(config=config)
print(f"Found {len(media_list)} files")

# Download
 download_files(config=config)

CLI Usage

# Basic download
media-archive-sync --remote https://archive.example.com/vods/ --local ./downloads

# With organization by month
media-archive-sync --remote https://archive.example.com/vods/ --local ./media --organize

# Dry run (preview only)
media-archive-sync --remote https://archive.example.com/vods/ --dry-run

# Parallel downloads with 10 workers
media-archive-sync --remote https://archive.example.com/vods/ --workers 10

Docker Usage

# Pull image
docker pull ghcr.io/djdembeck/media-archive-sync:latest

# Run with mounted volumes
docker run --rm \
    -v /host/media:/media:rw \
    -v /host/cache:/app/.cache:rw \
    ghcr.io/djdembeck/media-archive-sync:latest \
    --remote https://archive.example.com/vods/ \
    --local /media

Configuration

See Configuration Guide for all available options.

Contributing

See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

media_archive_sync-0.1.0.tar.gz (53.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

media_archive_sync-0.1.0-py3-none-any.whl (36.4 kB view details)

Uploaded Python 3

File details

Details for the file media_archive_sync-0.1.0.tar.gz.

File metadata

  • Download URL: media_archive_sync-0.1.0.tar.gz
  • Upload date:
  • Size: 53.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for media_archive_sync-0.1.0.tar.gz
Algorithm Hash digest
SHA256 15d223f622e8a71c02772bdcc76a5470f1c885788f9dc743172e0229faf882eb
MD5 2cbfd671d571045ab70fa94355a4e95d
BLAKE2b-256 c71a225f3ce43efd3a0259a4f411313b75ded1a301748c7dcde7206b9bb508f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for media_archive_sync-0.1.0.tar.gz:

Publisher: publish.yml on djdembeck/media-archive-sync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file media_archive_sync-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for media_archive_sync-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ec12ff40551d5d9e348e536a37f3d1a6e36f3908a524f1ab12f35218fcab325
MD5 729695d3d06cc9e111259e0542253976
BLAKE2b-256 053c0e2d07e0f565d34d23e46a845e1f0846a5a3b929fe8bff59217cf3dcee49

See more details on using hashes here.

Provenance

The following attestation bundles were made for media_archive_sync-0.1.0-py3-none-any.whl:

Publisher: publish.yml on djdembeck/media-archive-sync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page