A catalog to define, create, store, and access datasets
Project description
Data Catalog
A catalog to define, create, store, and access datasets.
This python package aims to streamline data engineering and data analysis during data science projects:
- organize all datasets used or created by your project,
- define datasets as a transformation of others,
- easily propagate updates when datasets are updated,
- avoid boilerplate code,
- access datasets from anywhere in your code, without having to remember file paths,
- share dataset definitions within a project team,
- document datasets,
- and enable smooth transitions from exploration to deployment.
Many data cataloging python packages exist (kedro, prefect, ...) for slightly different use cases. This package is tailored for managing datasets during data science projects. The emphasis is on minimal boilerplate, easy data access, and no-effort updates.
Installation
Use a python environment compatible with this project, e.g. with conda:
conda create -n my_env python "pandas>=0.19" "dask>=0.2.0" "s3fs>=0.2.0" "pytz>=2011k" pytest pyarrow
Install this package:
pip install git+https://github.com/numerical-io/data_catalog.git@main
Example 1: a catalog of datasets
A data catalog is defined in a python module or package. The following catalog defines three classes, each representing a dataset:
- DatasetA, defined by code (
create
function), - DatasetB, defined as a transformation of DatasetA (
create
function andparents
attribute), - DatasetC, defined and read from a CSV file (
relative_path
attribute).
Each class inherits from a class defining the data format on disk. This example uses CSV and parquet files.
# example_catalog.py
import pandas as pd
from data_catalog.datasets import CsvDataset, ParquetDataset
class DatasetA(CsvDataset):
"""A dataset defined in code, and saved as a CSV file.
"""
def create(self):
df_a = pd.DataFrame({"col1": [1, 2, 3]})
return df_a
read_kwargs = {"index_col": 0}
class DatasetB(ParquetDataset):
"""A dataset defined from another dataset, and saved as a Parquet file.
"""
parents = [DatasetA]
def create(self, df_a):
df_b = 2 * df_a
return df_b
class DatasetC(CsvDataset):
"""A dataset defined in a CSV file.
"""
relative_path = "dataset_c.csv"
This catalog definition contains all that is needed. The datasets defined in code can be generated by a succession of tasks, that we encode in a task graph. The task graph follows the Dask DAG format, and we execute it with Dask. (The graph itself is otherwise independent from Dask, and could be run by an engine of your choice.)
from data_catalog.taskgraph import create_task_graph
from dask.threaded import get
from example_catalog import DatasetA, DatasetB
# Define a context. The context is necessary to instanciate datasets.
# It contains an URI indicating where to save all the catalog's datasets.
context = {
"catalog_uri": "file:///path/to/data/folder"
}
# Generate a task graph to create datasets, resolving dependencies between them.
datasets = [DatasetA, DatasetB] # leave out DatasetC unless you provide a file dataset_c.csv
taskgraph, targets = create_task_graph(datasets, context)
# Let Dask generate all datasets on disk
_ = get(taskgraph, targets)
Once the files are created, you can access datasets from anywhere in your project.
dataset_b = DatasetB(context)
# Load into a dataframe
df = dataset_b.read()
# View its description
dataset_b.description()
# View the file path
dataset_b.path()
Example 2: a catalog with collections of datasets
Sometimes data is available as a collection of identically formatted files. Collections of datasets are available to handle this case.
Collections can be defined in a catalog as follows:
# example_catalog.py
import pandas as pd
from data_catalog.datasets import ParquetDataset
from data_catalog.collections import FileCollection, same_key_in
class CollectionA(FileCollection):
"""A collection of datasets saved as Parquet files.
"""
def keys(self):
return ["file_1", "file_2", "file_3"]
class Item(ParquetDataset):
def create(self):
df = pd.DataFrame({"col1": [1, 2, 3]})
return df
class CollectionB(FileCollection):
"""A collection defined from CollectionA.
Each item corresponds to one item in CollectionA.
"""
def keys(self):
return ["file_1", "file_2", "file_3"]
class Item(ParquetDataset):
parents = [same_key_in(CollectionA)]
def create(self, df):
return 2 * df
class DatasetD(ParquetDataset):
"""A dataset concatenating all items from CollectionA.
"""
parents = [CollectionA]
def create(self, collection_a):
df = pd.concat(collection_a)
return df
The generation of files is identical as in the previous example:
from data_catalog.taskgraph import create_task_graph
from dask.threaded import get
from example_catalog import CollectionA, CollectionB, DatasetD
# Define the catalog's context
context = {
"catalog_uri": "file:///path/to/data/folder"
}
# Generate the task graph and run it with Dask
taskgraph, targets = create_task_graph(
[CollectionA, CollectionB, DatasetD], context
)
_ = get(taskgraph, targets)
You can then access data anywhere in your project.
# Load a collection
CollectionA(context).read()
# Get a single dataset from a collection.
item_2 = CollectionA.get("file_2")
# item_2 is a usual dataset object
df = item_2(context).read()
The task graph only includes necessary updates. If all files exist and parents have older update times than their children, no task will be executed. If however you modify a file, the task graph will contain tasks to update all its descendants. When modifying the code of a dataset, remove the corresponding file to trigger its re-creation, and the updates of all its descendants.
Dataset attributes
When defining a dataset class, you can set the following attributes:
parents
: A list of dataset/collection classes from which this dataset is defined.create
: A method to create the dataset. It takes as inputs, aside fromself
, the data loaded from all classes inparents
. The number of input arguments (not countingself
) must therefore be equal to the length ofparents
. The method must return the created data.relative_path
: The file path, relative to the catalog URI.file_extension
: The file extension.is_binary_file
: A boolean indicating whether the file is a text or binary file.read_kwargs
: A dict of keyword arguments for reading the dataset.write_kwargs
: A dict of keyword arguments for writing the dataset.
All these attributes are optional, and have default values if omitted.
When relative_path is missing, it is inferred from the class name and path in the package. For instance, a CSV dataset SomeDataset
defined in the submodule example_catalog.part_one
will have a relative path set to part_one/SomeDataset.csv
.
If a docstring is set, it becomes the dataset description available through the description()
method.
Datasets must inherit from a subclass of AbstractDataset
. The data catalog provides a few such classes for common cases: CsvDataset
, ParquetDataset
, PickleDataset
, ExcelDataset
, and YamlDataset
.
Collection attributes
A collection can have the following attributes:
Item
: A nested class defining a dataset in the collection. It is a template for each item in the collection.keys
: A method returning a list of keys. Each key maps to a collection item. Files in the collection are named after keys, and conversely.relative_path
: If set, this path refers to the directory containing collection data files. This value is used to define therelative_path
for eachItem
.
Collections inherit from FileCollection
.
Collection have a class method get
that returns dataset classes for given keys.
Managing the catalog
The data files reside at the URI set in the context
variable, used for instanciating all objects. The catalog supports, as of now, URI's pointing to local files (file://
) or to S3 (s3://
). Note that the catalog itself is defined independently of its location; only data instances are dependent on the context. This facilitates the creation of several copies, e.g. for sharing between different users or versioning datasets.
To view all datasets and collections defined in a catalog, use the following functions:
from data_catalog.utils import describe_catalog, list_catalog
import example_catalog
# Get dataset names and descriptions, in a dict.
describe_catalog(example_catalog)
# List all classes representing datasets and collections.
# If example_catalog is a package, the list will contain
# classes from all _imported_ submodules.
list_catalog(example_catalog)
When running the task graph, each task logs messages to a logger named data_catalog. This logger configuration will show messages on sys.stderr:
import logging
logger = logging.getLogger("data_catalog")
logger.setLevel(logging.INFO)
log_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
log_handler = logging.StreamHandler()
log_handler.setLevel(logging.INFO)
log_handler.setFormatter(log_formatter)
logger.addHandler(log_handler)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file project_data_catalog-0.3.1.tar.gz
.
File metadata
- Download URL: project_data_catalog-0.3.1.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d61d54e89e42e610594f24923cc1c81d8e16f6d6bafd1e09cda5721b1ef527cd |
|
MD5 | 5f9b9b89f0c94ccf0be10635763330a6 |
|
BLAKE2b-256 | 0baaf44c5e803d2817b715bd0972c92b7392e5f33e87fe02c156147fbcf1db30 |
File details
Details for the file project_data_catalog-0.3.1-py3-none-any.whl
.
File metadata
- Download URL: project_data_catalog-0.3.1-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b4eba5f4470834161f2fcaa0746aa317bfe3bcbb7f714198c83833b2025042c |
|
MD5 | 4a936209469c65352d99ba9c616cc6e9 |
|
BLAKE2b-256 | c823bdced0d7418312e0fdddae39237857a52d6418270e7d1e14f046b3be8a1f |