Skip to main content

A catalog to define, create, store, and access datasets

Project description

Data Catalog

A catalog to define, create, store, and access datasets.

This python package aims to streamline data engineering and data analysis during data science projects:

  • organize all datasets used or created by your project,
  • define datasets as a transformation of others,
  • easily propagate updates when datasets are updated,
  • avoid boilerplate code,
  • access datasets from anywhere in your code, without having to remember file paths,
  • share dataset definitions within a project team,
  • document datasets,
  • and enable smooth transitions from exploration to deployment.

Many data cataloging python packages exist (kedro, prefect, ...) for slightly different use cases. This package is tailored for managing datasets during data science projects. The emphasis is on minimal boilerplate, easy data access, and no-effort updates.

Installation

Use a python environment compatible with this project, e.g. with conda:

conda create -n my_env python "pandas>=0.19" "dask>=0.2.0" "s3fs>=0.2.0" "pytz>=2011k" pytest pyarrow

Install this package:

pip install git+https://github.com/numerical-io/data_catalog.git@main

Example 1: a catalog of datasets

A data catalog is defined in a python module or package. The following catalog defines three classes, each representing a dataset:

  • DatasetA, defined by code (create function),
  • DatasetB, defined as a transformation of DatasetA (create function and parents attribute),
  • DatasetC, defined and read from a CSV file (relative_path attribute).

Each class inherits from a class defining the data format on disk. This example uses CSV and parquet files.

# example_catalog.py

import pandas as pd
from data_catalog.datasets import CsvDataset, ParquetDataset


class DatasetA(CsvDataset):
    """A dataset defined in code, and saved as a CSV file.
    """

    def create(self):
        df_a = pd.DataFrame({"col1": [1, 2, 3]})
        return df_a

    read_kwargs = {"index_col": 0}


class DatasetB(ParquetDataset):
    """A dataset defined from another dataset, and saved as a Parquet file.
    """

    parents = [DatasetA]

    def create(self, df_a):
        df_b = 2 * df_a
        return df_b


class DatasetC(CsvDataset):
    """A dataset defined in a CSV file.
    """

    relative_path = "dataset_c.csv"

This catalog definition contains all that is needed. The datasets defined in code can be generated by a succession of tasks, that we encode in a task graph. The task graph follows the Dask DAG format, and we execute it with Dask. (The graph itself is otherwise independent from Dask, and could be run by an engine of your choice.)

from data_catalog.taskgraph import create_task_graph
from dask.threaded import get

from example_catalog import DatasetA, DatasetB

# Define a context. The context is necessary to instanciate datasets.
# It contains an URI indicating where to save all the catalog's datasets.
context = {
    "catalog_uri": "file:///path/to/data/folder"
}

# Generate a task graph to create datasets, resolving dependencies between them.
datasets = [DatasetA, DatasetB] # leave out DatasetC unless you provide a file dataset_c.csv
taskgraph, targets = create_task_graph(datasets, context)

# Let Dask generate all datasets on disk
_ = get(taskgraph, targets)

Once the files are created, you can access datasets from anywhere in your project.

dataset_b = DatasetB(context)

# Load into a dataframe
df = dataset_b.read()

# View its description
dataset_b.description()

# View the file path
dataset_b.path()

Example 2: a catalog with collections of datasets

Sometimes data is available as a collection of identically formatted files. Collections of datasets are available to handle this case.

Collections can be defined in a catalog as follows:

# example_catalog.py

import pandas as pd
from data_catalog.datasets import ParquetDataset
from data_catalog.collections import FileCollection, same_key_in


class CollectionA(FileCollection):
    """A collection of datasets saved as Parquet files.
    """

    def keys(self):
        return ["file_1", "file_2", "file_3"]

    class Item(ParquetDataset):
        def create(self):
            df = pd.DataFrame({"col1": [1, 2, 3]})
            return df


class CollectionB(FileCollection):
    """A collection defined from CollectionA.

    Each item corresponds to one item in CollectionA.
    """

    def keys(self):
        return ["file_1", "file_2", "file_3"]

    class Item(ParquetDataset):
        parents = [same_key_in(CollectionA)]

        def create(self, df):
            return 2 * df


class DatasetD(ParquetDataset):
    """A dataset concatenating all items from CollectionA.
    """

    parents = [CollectionA]

    def create(self, collection_a):
        df = pd.concat(collection_a)
        return df

The generation of files is identical as in the previous example:

from data_catalog.taskgraph import create_task_graph
from dask.threaded import get

from example_catalog import CollectionA, CollectionB, DatasetD

# Define the catalog's context
context = {
    "catalog_uri": "file:///path/to/data/folder"
}

# Generate the task graph and run it with Dask
taskgraph, targets = create_task_graph(
    [CollectionA, CollectionB, DatasetD], context
)
_ = get(taskgraph, targets)

You can then access data anywhere in your project.

# Load a collection
CollectionA(context).read()

# Get a single dataset from a collection.
item_2 = CollectionA.get("file_2")

# item_2 is a usual dataset object
df = item_2(context).read()

The task graph only includes necessary updates. If all files exist and parents have older update times than their children, no task will be executed. If however you modify a file, the task graph will contain tasks to update all its descendants. When modifying the code of a dataset, remove the corresponding file to trigger its re-creation, and the updates of all its descendants.

Dataset attributes

When defining a dataset class, you can set the following attributes:

  • parents: A list of dataset/collection classes from which this dataset is defined.
  • create: A method to create the dataset. It takes as inputs, aside from self, the data loaded from all classes in parents. The number of input arguments (not counting self) must therefore be equal to the length of parents. The method must return the created data.
  • relative_path: The file path, relative to the catalog URI.
  • file_extension: The file extension.
  • is_binary_file: A boolean indicating whether the file is a text or binary file.
  • read_kwargs: A dict of keyword arguments for reading the dataset.
  • write_kwargs: A dict of keyword arguments for writing the dataset.

All these attributes are optional, and have default values if omitted.

When relative_path is missing, it is inferred from the class name and path in the package. For instance, a CSV dataset SomeDataset defined in the submodule example_catalog.part_one will have a relative path set to part_one/SomeDataset.csv.

If a docstring is set, it becomes the dataset description available through the description() method.

Datasets must inherit from a subclass of AbstractDataset. The data catalog provides a few such classes for common cases: CsvDataset, ParquetDataset, PickleDataset, ExcelDataset, and YamlDataset.

Collection attributes

A collection can have the following attributes:

  • Item: A nested class defining a dataset in the collection. It is a template for each item in the collection.
  • keys: A method returning a list of keys. Each key maps to a collection item. Files in the collection are named after keys, and conversely.
  • relative_path: If set, this path refers to the directory containing collection data files. This value is used to define the relative_path for each Item.

Collections inherit from FileCollection.

Collection have a class method get that returns dataset classes for given keys.

Managing the catalog

The data files reside at the URI set in the context variable, used for instanciating all objects. The catalog supports, as of now, URI's pointing to local files (file://) or to S3 (s3://). Note that the catalog itself is defined independently of its location; only data instances are dependent on the context. This facilitates the creation of several copies, e.g. for sharing between different users or versioning datasets.

To view all datasets and collections defined in a catalog, use the following functions:

from data_catalog.utils import describe_catalog, list_catalog

import example_catalog

# Get dataset names and descriptions, in a dict.
describe_catalog(example_catalog)

# List all classes representing datasets and collections.
# If example_catalog is a package, the list will contain
# classes from all _imported_ submodules.
list_catalog(example_catalog)

When running the task graph, each task logs messages to a logger named data_catalog. This logger configuration will show messages on sys.stderr:

import logging

logger = logging.getLogger("data_catalog")

logger.setLevel(logging.INFO)
log_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
log_handler = logging.StreamHandler()
log_handler.setLevel(logging.INFO)
log_handler.setFormatter(log_formatter)
logger.addHandler(log_handler)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

project_data_catalog-0.3.1.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

project_data_catalog-0.3.1-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file project_data_catalog-0.3.1.tar.gz.

File metadata

File hashes

Hashes for project_data_catalog-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d61d54e89e42e610594f24923cc1c81d8e16f6d6bafd1e09cda5721b1ef527cd
MD5 5f9b9b89f0c94ccf0be10635763330a6
BLAKE2b-256 0baaf44c5e803d2817b715bd0972c92b7392e5f33e87fe02c156147fbcf1db30

See more details on using hashes here.

File details

Details for the file project_data_catalog-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for project_data_catalog-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2b4eba5f4470834161f2fcaa0746aa317bfe3bcbb7f714198c83833b2025042c
MD5 4a936209469c65352d99ba9c616cc6e9
BLAKE2b-256 c823bdced0d7418312e0fdddae39237857a52d6418270e7d1e14f046b3be8a1f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page