Skip to main content

No project description provided

Project description

Bulkget

Bulkget is a Python-based command-line tool for efficiently downloading a large number of files from a list of URLs. It offers flexibility by supporting two different download managers: the robust and feature-rich aria2c for high-performance downloads, and a simple, built-in urllib manager for environments where aria2c is not available.

Features

  • Bulk Downloading: Download a large number of files from a list of URLs specified in a JSON file.
  • Choice of Download Manager:
    • aria2c: For fast and reliable downloads, with support for features like parallel downloads and automatic retries. Requires aria2c to be installed on your system.
    • urllib: A lightweight, dependency-free downloader for basic needs. It downloads files to a temporary *.tmp file and renames them upon completion to prevent partial downloads.
  • Parallel Downloads: Download multiple files concurrently to maximize bandwidth usage (configurable with -n or --n-workers).
  • Checksum Verification: Ensure file integrity by verifying checksums after download using the --checksum flag.
  • Dry Run Mode: Simulate the download process without actually downloading any files using the --dry-run flag.
  • Customizable File Paths: Use a Python script with a filepath_hook function to define custom output paths for downloaded files via the --filepath-hook argument.
  • Overwrite Control: Choose whether to overwrite files that already exist in the destination with the --overwrite flag.

Installation

  1. Install aria2c (optional, for aria2c manager): On Debian/Ubuntu:

    sudo apt-get install aria2
    

    On macOS:

    brew install aria2
    
  2. Install Bulkget: Clone the repository and install the package using Poetry:

    git clone https://github.com/ayghri/bulkget.git
    cd bulkget
    poetry install
    

Usage

The primary entry point for the tool is the bulkget command-line interface.

Command-Line Interface

bulkget [OPTIONS] list

Arguments:

  • list: Path to the JSON file containing the list of files to download.

Options:

  • --path TEXT: Target directory to download files to. Defaults to the current directory.
  • --manager [aria2c|urllib]: The download manager to use. Defaults to aria2c.
  • -n, --n-workers INTEGER: Number of parallel download workers. Defaults to 4.
  • --overwrite: Overwrite existing files.
  • --checksum: Verify file checksums after download.
  • --dry-run: Simulate the download without actual file transfers.
  • --port INTEGER: Port for the aria2c RPC server. Defaults to 6800.
  • --filepath-hook TEXT: Path to a Python file with a 'filepath_hook' function to customize output file paths.
  • --help: Show the help message and exit.

JSON File Format

The list file should be a JSON object containing a list of file information objects.

{
  "properties": {},
  "files": [
    {
      "name": "file1.txt",
      "url": "http://example.com/file1.txt",
      "checksum": "f2ca1bb6c7e907d06dafe4687e579fce76b37e4e93b7605022da52e6ccc26fd2",
      "checksum_type": "sha256"
    },
    {
      "name": "file2.zip",
      "url": "http://example.com/file2.zip",
      "size": 1024
    }
  ]
}
  • name: The name of the file.
  • url: The URL to download the file from.
  • checksum (optional): The checksum hash of the file.
  • checksum_type (optional): The checksum algorithm (e.g., 'md5', 'sha256').
  • size (optional): The size of the file in bytes.

Customizing File Paths

You can customize the output directory and filename for each downloaded file by providing a Python script with a filepath_hook function. This function receives a UrlInfo object and should return the desired relative path for the file.

Use the --filepath-hook argument to specify your script.

Example hook file (my_hooks.py):

from pathlib import Path
from bulkget.utils import UrlInfo

def filepath_hook(file_info: UrlInfo) -> Path:
    # Example: save files into subdirectories based on the first letter of the filename
    first_letter = file_info.name[0].lower()
    return Path(first_letter) / file_info.name

Usage:

bulkget --filepath-hook my_hooks.py data/dataset.json

This will save files into subdirectories like a/, b/, etc., inside the target path.

Examples

Basic Download

To download the files specified in dataset.json to the downloads directory:

bulkget --path downloads data/dataset.json

Using the urllib Manager

To use the urllib manager with 8 parallel workers:

bulkget --manager urllib -n 8 data/dataset.json

Dry Run

To see which files would be downloaded without actually downloading them, including their source URLs and target paths:

bulkget --dry-run data/dataset.json

Verify Checksums

To verify file integrity after download:

bulkget --checksum data/dataset.json

Use Case: Downloading CESM2 Data

For a detailed guide on how to use bulkget to download data from the CESM2 Large Ensemble Project, please see the CESM2 Download Guide.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulkget-0.1.2.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bulkget-0.1.2-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file bulkget-0.1.2.tar.gz.

File metadata

  • Download URL: bulkget-0.1.2.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.5 Linux/6.12.31-gentoo-x86_64

File hashes

Hashes for bulkget-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c7e4723fd3e6111356dd8dd476a228b19d290bfb1b2585393bf0c71ac5def541
MD5 99af463f456379e0d2f03c26b99a4745
BLAKE2b-256 64b997d6583f32f95192a387bd708bcac549dfacfbd43e38065e811c3d3c8f71

See more details on using hashes here.

File details

Details for the file bulkget-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: bulkget-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.5 Linux/6.12.31-gentoo-x86_64

File hashes

Hashes for bulkget-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4640a56b3415779dadcff6847308eef73739b5140c5ab5b0883847186b261b4b
MD5 64324856d61fa27c5ecd190189c04cb5
BLAKE2b-256 60af8bd010197db7c8dbce76715a7f861c4f0cfa0f882c8b12e3d51764b3a472

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page