Skip to main content

No project description provided

Project description

DISPATCHES data packages

Introduction

What is it?

A simple way to distribute and refer to data files that cannot be included directly in the DISPATCHES repository.

Goals

  • Provide a reliable way to access the location of data files from DISPATCHES client code, regardless of the specifics of how DISPATCHES is installed (editable vs non-editable installation, local development vs CI, etc)
  • Leverage as much as possible the built-in Python package distribution infrastructure to distribute collections of related non-code-files of small to moderate size (< 100 MB compressed)
  • Allow using multiple repositories/package distributions to be used in a seamless way, so that the size limits only apply to each data package independently

Non-goals

  • Manage the packaged data in any way beyond the file-system level
    • i.e. the data package infrastructure only provide paths, which the client code uses to load the data in memory according to its specific needs
  • Manage and/or expose metadata beyond the name of the package and the Python package distribution used to installed it
  • Automatically enforce data distribution compliance requirements (LICENSE, COPYRIGHT, etc)
    • This MUST still be done, but the process shall be manual rather than automatic

Requirements and Conventions

  • DISPATCHES data packages SHALL be available on GitHub as repositories owned by the https://github.com/gmlc-dispatches organization

  • DISPATCHES data packages MAY be available on PyPI

  • The naming scheme SHOULD be consistent and follow this convention (using my-example as a placeholder):

    • Repository URL: https://github.com/gmlc-dispatches/my-example-data
    • Python package distribution name: dispatches-my-example-data
  • The repository SHOULD register itself by adding the dispatches-data-package topic so that all data packages repositories can be browsed at the URL https://github.com/topics/dispatches-data-package

  • The repository MUST follow this directory structure:

    my-example-data/
    `- .git/
    `- pyproject.toml
    `- src/
      `- dispatches_data/
        `- packages/
          `- my_example/
            `- __init__.py
            `- README.md
    
  • Once installed, the data files SHALL be stored within the Python environment's site-packages directory as .../lib/python3.8/site-packages/dispatches_data/packages/my_example, i.e. the data package directory

  • The name of the data package directory (my_example) SHALL be used to refer to the data package

  • Users should access the data package and its contents using the functions available in the dispatches_data.api module

  • The Python package directory (i.e. .../lib/python3.8/site-packages/dispatches_data/packages/my_example) MUST contain ALL information required for distribution of the data

    • This includes, but is not limited to:
      • License
      • Copyright
  • The same information MAY be repeated at the top level of the repository, but it MUST be in the package directory

    • This is to ensure that all required information is always present when the data files are installed (which might not be the case if the information is stored at the top level of the repository)
  • More than one data packages MAY be distributed together (i.e. as part of the same repository and/or Python package distribution)

  • In this case, all of the above requirements apply to each data package individually (i.e. each separate data package directory MUST contain the appropriate required information)

Usage

Step 1

Locate the data package(s) required by your application. In general, unless otherwise indicated, the naming conventions described above apply.

Using the same my_example placeholder as above, the data package repository will be located at https://github.com/gmlc-dispatches/my-example-data.git

Step 2

Install the data package(s) required by your application, using pip.

pip install git+https://github.com/gmlc-dispatches/my-example-data.git

Step 3

Verify that the data packages where installed correctly, e.g.:

pip show dispatches-my-example-data

Step 4

It should now be possible to access the data package from the client code, i.e. the DISPATCHES code that will load and use the data files, using the functions exposed in the dispatches_data.api module. These are simple functions that typically take the data package name (my_example) as a str argument.

Example

Let's assume we want to create a dataframe from a file named mydata.csv in the my_example data package.

In a Python file or Jupyter notebook:

import pandas as pd

from dispatches_data.api import path


def load_data() -> pd.DataFrame:
    path_to_csv_file = path("my_example") / "mydata.csv"
    df = pd.read_csv(path_to_csv_file)
    # process df as needed
    return df


def main():
    df = load_data()
    ...  # rest of the code

API Reference

See the documentation for the dispatches_data.api module on ReadTheDocs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dispatches-data-packages-23.3.15.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dispatches_data_packages-23.3.15-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file dispatches-data-packages-23.3.15.tar.gz.

File metadata

File hashes

Hashes for dispatches-data-packages-23.3.15.tar.gz
Algorithm Hash digest
SHA256 325ab25b51f43e3c4f90bd4b1018cebd242a989c996c2cbea06cd6e9c62c367b
MD5 815f4c2c2e0fb7eb72f3831c5353dc69
BLAKE2b-256 e147325267dfd78dd96b8f0f4e8c6d69689a3e108788cfe2b4e5fa7c3435cd96

See more details on using hashes here.

File details

Details for the file dispatches_data_packages-23.3.15-py3-none-any.whl.

File metadata

File hashes

Hashes for dispatches_data_packages-23.3.15-py3-none-any.whl
Algorithm Hash digest
SHA256 3c117c7f10d6baaf45403da872300a45485da9b40adb16876efd3b68ecfdae2e
MD5 f776c8c050074e4ee422159c7a34b454
BLAKE2b-256 910e921518100dbb1642af39410071a1287cadd2e94934c753e721505b5a995c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page