Skip to main content

A friend to fetch your data files

Project description

Pooch: A friend to fetch your data files

Documentation (latest)Documentation (main branch)ContributingContact

Part of the Fatiando a Terra project

Latest version on PyPI Latest version on conda-forge Test coverage status Compatible Python versions. DOI used to cite Pooch

About

Just want to download a file without messing with requests and urllib? Trying to add sample datasets to your Python package? Pooch is here to help!

Pooch is a Python library that can manage data by downloading files from a server (only when needed) and storing them locally in a data cache (a folder on your computer).

  • Pure Python and minimal dependencies.
  • Download files over HTTP, FTP, and from data repositories like Zenodo and figshare.
  • Built-in post-processors to unzip/decompress the data after download.
  • Designed to be extended: create custom downloaders and post-processors.

Are you a scientist or researcher? Pooch can help you too!

  • Host your data on a repository and download using the DOI.
  • Automatically download data using code instead of telling colleagues to do it themselves.
  • Make sure everyone running the code has the same version of the data files.

Projects using Pooch

SciPy, scikit-image, xarray, Ensaio, GemPy, MetPy, napari, Satpy, yt, PyVista, icepack, histolab, seaborn-image, Open AR-Sandbox, climlab, mne-python, GemGIS, SHTOOLS, MOABB, GeoViews, ScopeSim, Brainrender, pyxem, cellfinder, PVGeo, geosnap, BioCypher, cf-xarray, Scirpy, rembg, DASCore, scikit-mobility, Py-ART, HyperSpy, RosettaSciIO, eXSpy

If you're using Pooch, send us a pull request adding your project to the list.

Example

For a scientist downloading a data file for analysis:

import pooch
import pandas as pd

# Download a file and save it locally, returning the path to it.
# Running this again will not cause a download. Pooch will check the hash
# (checksum) of the downloaded file against the given value to make sure
# it's the right file (not corrupted or outdated).
fname_bathymetry = pooch.retrieve(
    url="https://github.com/fatiando-data/caribbean-bathymetry/releases/download/v1/caribbean-bathymetry.csv.xz",
    known_hash="md5:a7332aa6e69c77d49d7fb54b764caa82",
)

# Pooch can also download based on a DOI from certain providers.
fname_gravity = pooch.retrieve(
    url="doi:10.5281/zenodo.5882430/southern-africa-gravity.csv.xz",
    known_hash="md5:1dee324a14e647855366d6eb01a1ef35",
)

# Load the data with Pandas
data_bathymetry = pd.read_csv(fname_bathymetry)
data_gravity = pd.read_csv(fname_gravity)

For package developers including sample data in their projects:

"""
Module mypackage/datasets.py
"""
import pkg_resources
import pandas
import pooch

# Get the version string from your project. You have one of these, right?
from . import version

# Create a new friend to manage your sample data storage
GOODBOY = pooch.create(
    # Folder where the data will be stored. For a sensible default, use the
    # default cache folder for your OS.
    path=pooch.os_cache("mypackage"),
    # Base URL of the remote data store. Will call .format on this string
    # to insert the version (see below).
    base_url="https://github.com/myproject/mypackage/raw/{version}/data/",
    # Pooches are versioned so that you can use multiple versions of a
    # package simultaneously. Use PEP440 compliant version number. The
    # version will be appended to the path.
    version=version,
    # If a version as a "+XX.XXXXX" suffix, we'll assume that this is a dev
    # version and replace the version with this string.
    version_dev="main",
    # An environment variable that overwrites the path.
    env="MYPACKAGE_DATA_DIR",
    # The cache file registry. A dictionary with all files managed by this
    # pooch. Keys are the file names (relative to *base_url*) and values
    # are their respective SHA256 hashes. Files will be downloaded
    # automatically when needed (see fetch_gravity_data).
    registry={"gravity-data.csv": "89y10phsdwhs09whljwc09whcowsdhcwodcydw"}
)
# You can also load the registry from a file. Each line contains a file
# name and it's sha256 hash separated by a space. This makes it easier to
# manage large numbers of data files. The registry file should be packaged
# and distributed with your software.
GOODBOY.load_registry(
    pkg_resources.resource_stream("mypackage", "registry.txt")
)

# Define functions that your users can call to get back the data in memory
def fetch_gravity_data():
    """
    Load some sample gravity data to use in your docs.
    """
    # Fetch the path to a file in the local storage. If it's not there,
    # we'll download it.
    fname = GOODBOY.fetch("gravity-data.csv")
    # Load it with numpy/pandas/etc
    data = pandas.read_csv(fname)
    return data

Getting involved

🗨️ Contact us: Find out more about how to reach us at fatiando.org/contact.

👩🏾‍💻 Contributing to project development: Please read our Contributing Guide to see how you can help and give feedback.

🧑🏾‍🤝‍🧑🏼 Code of conduct: This project is released with a Code of Conduct. By participating in this project you agree to abide by its terms.

Imposter syndrome disclaimer: We want your help. No, really. There may be a little voice inside your head that is telling you that you're not ready, that you aren't skilled enough to contribute. We assure you that the little voice in your head is wrong. Most importantly, there are many valuable ways to contribute besides writing code.

This disclaimer was adapted from the MetPy project.

License

This is free software: you can redistribute it and/or modify it under the terms of the BSD 3-clause License. A copy of this license is provided in LICENSE.txt.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pooch-1.8.2.tar.gz (59.4 kB view details)

Uploaded Source

Built Distribution

pooch-1.8.2-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file pooch-1.8.2.tar.gz.

File metadata

  • Download URL: pooch-1.8.2.tar.gz
  • Upload date:
  • Size: 59.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for pooch-1.8.2.tar.gz
Algorithm Hash digest
SHA256 76561f0de68a01da4df6af38e9955c4c9d1a5c90da73f7e40276a5728ec83d10
MD5 7a333ef27c34984385c25f1e0b156185
BLAKE2b-256 c677b3d3e00c696c16cf99af81ef7b1f5fe73bd2a307abca41bd7605429fe6e5

See more details on using hashes here.

File details

Details for the file pooch-1.8.2-py3-none-any.whl.

File metadata

  • Download URL: pooch-1.8.2-py3-none-any.whl
  • Upload date:
  • Size: 64.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for pooch-1.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3529a57096f7198778a5ceefd5ac3ef0e4d06a6ddaf9fc2d609b806f25302c47
MD5 830674534379589ada41ffb585feeea4
BLAKE2b-256 a88777cc11c7a9ea9fd05503def69e3d18605852cd0d4b0d3b8f15bbeb3ef1d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page