Skip to main content

No project description provided

Project description

DISPATCHES data packages

Introduction

What is it?

A simple way to distribute and refer to data files that cannot be included directly in the DISPATCHES repository.

Goals

  • Provide a reliable way to access the location of data files from DISPATCHES client code, regardless of the specifics of how DISPATCHES is installed (editable vs non-editable installation, local development vs CI, etc)
  • Leverage as much as possible the built-in Python package distribution infrastructure to distribute collections of related non-code-files of small to moderate size (< 100 MB compressed)
  • Allow using multiple repositories/package distributions to be used in a seamless way, so that the size limits only apply to each data package independently

Non-goals

  • Manage the packaged data in any way beyond the file-system level
    • i.e. the data package infrastructure only provide paths, which the client code uses to load the data in memory according to its specific needs
  • Manage and/or expose metadata beyond the name of the package and the Python package distribution used to installed it
  • Automatically enforce data distribution compliance requirements (LICENSE, COPYRIGHT, etc)
    • This MUST still be done, but the process shall be manual rather than automatic

Requirements and Conventions

  • DISPATCHES data packages SHALL be available on GitHub as repositories owned by the https://github.com/gmlc-dispatches organization

  • DISPATCHES data packages MAY be available on PyPI

  • The naming scheme SHOULD be consistent and follow this convention (using my-example as a placeholder):

    • Repository URL: https://github.com/gmlc-dispatches/my-example-data
    • Python package distribution name: dispatches-my-example-data
  • The repository SHOULD register itself by adding the dispatches-data-package topic so that all data packages repositories can be browsed at the URL https://github.com/topics/dispatches-data-package

  • The repository MUST follow this directory structure:

    my-example-data/
    `- .git/
    `- pyproject.toml
    `- src/
      `- dispatches_data/
        `- packages/
          `- my_example/
            `- __init__.py
            `- README.md
    
  • Once installed, the data files SHALL be stored within the Python environment's site-packages directory as .../lib/python3.8/site-packages/dispatches_data/packages/my_example, i.e. the data package directory

  • The name of the data package directory (my_example) SHALL be used to refer to the data package

  • Users should access the data package and its contents using the functions available in the dispatches_data.api module

  • The Python package directory (i.e. .../lib/python3.8/site-packages/dispatches_data/packages/my_example) MUST contain ALL information required for distribution of the data

    • This includes, but is not limited to:
      • License
      • Copyright
  • The same information MAY be repeated at the top level of the repository, but it MUST be in the package directory

    • This is to ensure that all required information is always present when the data files are installed (which might not be the case if the information is stored at the top level of the repository)
  • More than one data packages MAY be distributed together (i.e. as part of the same repository and/or Python package distribution)

  • In this case, all of the above requirements apply to each data package individually (i.e. each separate data package directory MUST contain the appropriate required information)

Usage

Step 1

Locate the data package(s) required by your application. In general, unless otherwise indicated, the naming conventions described above apply.

Using the same my_example placeholder as above, the data package repository will be located at https://github.com/gmlc-dispatches/my-example-data.git

Step 2

Install the data package(s) required by your application, using pip.

pip install git+https://github.com/gmlc-dispatches/my-example-data.git

Step 3

Verify that the data packages where installed correctly, e.g.:

pip show dispatches-my-example-data

Step 4

It should now be possible to access the data package from the client code, i.e. the DISPATCHES code that will load and use the data files, using the functions exposed in the dispatches_data.api module. These are simple functions that typically take the data package name (my_example) as a str argument.

Example

Let's assume we want to create a dataframe from a file named mydata.csv in the my_example data package.

In a Python file or Jupyter notebook:

import pandas as pd

from dispatches_data.api import path


def load_data() -> pd.DataFrame:
    path_to_csv_file = path("my_example") / "mydata.csv"
    df = pd.read_csv(path_to_csv_file)
    # process df as needed
    return df


def main():
    df = load_data()
    ...  # rest of the code

API Reference

See the documentation for the dispatches_data.api module on ReadTheDocs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dispatches-data-packages-23.3.19.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file dispatches-data-packages-23.3.19.tar.gz.

File metadata

File hashes

Hashes for dispatches-data-packages-23.3.19.tar.gz
Algorithm Hash digest
SHA256 527025691d9ece8637259b0be3f2ed6805bdbed6c3f1d220c32fc322fc99a474
MD5 0419eb991a076c2705e280da9cc84a35
BLAKE2b-256 b474e3e6edbe7019793b2bde1e199da0684cf297b7616cf20b289641d606df82

See more details on using hashes here.

File details

Details for the file dispatches_data_packages-23.3.19-py3-none-any.whl.

File metadata

File hashes

Hashes for dispatches_data_packages-23.3.19-py3-none-any.whl
Algorithm Hash digest
SHA256 b1c875be23fa49fb58e00bb3daf455beac5c94d8e0e0d78a9812a14f20eaeeda
MD5 c49092ab4b0bc41ace2a66a14f20ca24
BLAKE2b-256 6e407d3547f392ff25629ae00adfd7bfb9925b130b7af5f6b785105c31e8a4c7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page