Download and organize media from web archives
Project description
Media Archive Sync
Download and organize media files from web archives (Apache-style directory listings).
Features
- 🌐 Web Archive Crawling - Crawl Apache-style directory listings
- 📥 Parallel Downloads - Download multiple files simultaneously with resume support
- 📁 Smart Organization - Organize by date, custom naming conventions
- 📝 Metadata Support - Generate NFO sidecar files
- 🎬 Video Merging - Concatenate multipart video files
- 💾 Resume Support - Cache progress and resume interrupted downloads
- 🔧 Configurable - Everything is configurable, no hardcoded values
Installation
From PyPI (recommended)
pip install media-archive-sync
From source
pip install git+https://github.com/djdembeck/media-archive-sync.git
Development install
git clone https://github.com/djdembeck/media-archive-sync.git
cd media-archive-sync
pip install -e ".[dev]"
# Enable git hooks (runs lint/tests automatically)
git config --local core.hooksPath .githooks
Or use the Makefile for one-step setup:
make dev-install # Installs deps and enables hooks
Quick Start
from media_archive_sync import ArchiveConfig, crawl_archive, download_files
# Configure
config = ArchiveConfig(
remote_base="https://archive.example.com/vods/",
local_root="./downloads",
workers=5,
)
# Crawl remote archive
media_list, dir_counts = crawl_archive(config=config)
print(f"Found {len(media_list)} files")
# Download
download_files(config=config)
CLI Usage
# Basic download
media-archive-sync --remote https://archive.example.com/vods/ --local ./downloads
# With organization by month
media-archive-sync --remote https://archive.example.com/vods/ --local ./media --organize
# Dry run (preview only)
media-archive-sync --remote https://archive.example.com/vods/ --dry-run
# Parallel downloads with 10 workers
media-archive-sync --remote https://archive.example.com/vods/ --workers 10
Docker Usage
# Pull image
docker pull ghcr.io/djdembeck/media-archive-sync:latest
# Run with mounted volumes
docker run --rm \
-v /host/media:/media:rw \
-v /host/cache:/app/.cache:rw \
ghcr.io/djdembeck/media-archive-sync:latest \
--remote https://archive.example.com/vods/ \
--local /media
Configuration
See Configuration Guide for all available options.
Contributing
See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file media_archive_sync-0.1.0.tar.gz.
File metadata
- Download URL: media_archive_sync-0.1.0.tar.gz
- Upload date:
- Size: 53.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15d223f622e8a71c02772bdcc76a5470f1c885788f9dc743172e0229faf882eb
|
|
| MD5 |
2cbfd671d571045ab70fa94355a4e95d
|
|
| BLAKE2b-256 |
c71a225f3ce43efd3a0259a4f411313b75ded1a301748c7dcde7206b9bb508f0
|
Provenance
The following attestation bundles were made for media_archive_sync-0.1.0.tar.gz:
Publisher:
publish.yml on djdembeck/media-archive-sync
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
media_archive_sync-0.1.0.tar.gz -
Subject digest:
15d223f622e8a71c02772bdcc76a5470f1c885788f9dc743172e0229faf882eb - Sigstore transparency entry: 1087752713
- Sigstore integration time:
-
Permalink:
djdembeck/media-archive-sync@5bbcfa630769e5b328ce4044929c54deaf516a7f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/djdembeck
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5bbcfa630769e5b328ce4044929c54deaf516a7f -
Trigger Event:
push
-
Statement type:
File details
Details for the file media_archive_sync-0.1.0-py3-none-any.whl.
File metadata
- Download URL: media_archive_sync-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ec12ff40551d5d9e348e536a37f3d1a6e36f3908a524f1ab12f35218fcab325
|
|
| MD5 |
729695d3d06cc9e111259e0542253976
|
|
| BLAKE2b-256 |
053c0e2d07e0f565d34d23e46a845e1f0846a5a3b929fe8bff59217cf3dcee49
|
Provenance
The following attestation bundles were made for media_archive_sync-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on djdembeck/media-archive-sync
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
media_archive_sync-0.1.0-py3-none-any.whl -
Subject digest:
4ec12ff40551d5d9e348e536a37f3d1a6e36f3908a524f1ab12f35218fcab325 - Sigstore transparency entry: 1087752796
- Sigstore integration time:
-
Permalink:
djdembeck/media-archive-sync@5bbcfa630769e5b328ce4044929c54deaf516a7f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/djdembeck
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5bbcfa630769e5b328ce4044929c54deaf516a7f -
Trigger Event:
push
-
Statement type: