Skip to main content

A tool to rescue (download) data from CKAN portals.

Project description

CKAN Rescue

A Python CLI tool to rescue (download) data from CKAN portals that implement the ckanext-datajson extension. This tool downloads all datasets and their distributions from a CKAN portal's data.json endpoint, organizing them in a structured directory format.

Description

CKAN Rescue allows you to bulk download datasets from CKAN data portals by fetching their data.json file and downloading all associated data files. The tool creates an organized directory structure based on the portal's homepage and dataset identifiers, making it easy to archive or backup entire data portals.

Key features:

  • Parallel downloads with configurable thread count
  • Organized directory structure by portal and dataset
  • Comprehensive logging of successful and failed downloads
  • Preserves original filenames when available
  • Handles large data portals efficiently

Installation from PyPI

Install the latest version using pip:

pip install ckan-rescue

Or install using uv:

uv add ckan-rescue

How to Use

Basic Usage

ckan-dcat-download <data.json_url>

Advanced Usage

# Specify output directory
ckan-dcat-download https://example.com/data.json -o /path/to/output

# Use more threads for faster downloads
ckan-dcat-download https://example.com/data.json -t 10

# Combine options
ckan-dcat-download https://example.com/data.json -o downloads -t 8

Command Line Options

  • url (required): URL of the data.json file from the CKAN portal
  • -o, --output: Output directory (default: output)
  • -t, --threads: Number of threads for parallel downloads (default: 5)
  • -v, --version: Show version information
  • -h, --help: Show help message

Examples

Download from a government data portal:

ckan-dcat-download https://data.gov/data.json

Download to a specific directory with 10 parallel threads:

ckan-dcat-download https://opendata.city.gov/data.json -o city_data -t 10

Output Structure

The tool creates the following directory structure:

output/
└── <portal_homepage>/
    ├── data.json                    # Original data.json file
    ├── logs.txt                     # Download logs
    └── data/
        └── <dataset_id>/
            └── <distribution_id>/
                └── <filename>       # Downloaded data file

Example Output Structure

output/
└── data.example.gov/
    ├── data.json
    ├── logs.txt
    └── data/
        ├── population-data-2023/
        │   ├── csv-distribution/
        │   │   └── population.csv
        │   └── json-distribution/
        │       └── population.json
        └── budget-dataset/
            └── excel-distribution/
                └── budget_2023.xlsx

Log Files

  • Success: logs.txt will contain "All downloads completed successfully."
  • Failures: logs.txt will list all failed downloads with error details
  • Console Output: Real-time progress and status updates

How to Develop

This project uses uv for dependency management and development.

Prerequisites

Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Development Setup

  1. Clone the repository:
git clone https://github.com/pdelboca/ckan-rescue.git
cd ckan-rescue
  1. Create and activate a virtual environment:
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install the project in development mode:
uv pip install -e .

Local Testing

Test your changes locally:

# Install in development mode
uv pip install -e .

# Test the CLI
ckan-dcat-download --help

How to Publish to PyPI

This project uses uv for building and publishing to PyPI.

Prerequisites

  1. Ensure you have PyPI credentials configured:
# Set the environment variable
export UV_PUBLISH_TOKEN=__token__

# or add it to the command directly
uv publish --token __token__

Publishing Steps

  1. Update version: Update the version of the project:
uv version  --bump [patch|minor|major]
  1. Build the package:
uv build
  1. Publish to PyPI:
# Publish to PyPI
uv publish

# Or publish to TestPyPI first (recommended)
uv publish --index-url https://test.pypi.org/simple/

Issues

If you encounter any problems or have feature requests, please file an issue at GitHub Issues.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckan_rescue-0.0.1.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckan_rescue-0.0.1-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file ckan_rescue-0.0.1.tar.gz.

File metadata

  • Download URL: ckan_rescue-0.0.1.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.14

File hashes

Hashes for ckan_rescue-0.0.1.tar.gz
Algorithm Hash digest
SHA256 dbce7e9e923aca3ba82e47381ced936f5e3afa2385cda5be8f0a66ce1c00fae6
MD5 e3e35f5a126912a409dbfac9c124b0e2
BLAKE2b-256 490ee94d604060b8106a395ca4e4f3174a230aea65acf5f127c1f065b0a7b0af

See more details on using hashes here.

File details

Details for the file ckan_rescue-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ckan_rescue-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 144d49621a8514a3b1cf0b8b5eec17e3446e3c43ae5ef65b68fad96c06214420
MD5 b3d38e4f1ea84a57a47727cf3c50d9c1
BLAKE2b-256 946fc3db075ad9cab306b8dc55d0a6c5a823067ee1b622794dcfc75e91229e15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page