A configurable replacement for `kedro catalog create`.
Project description
Kedro Auto Catalog
A configurable version of the built in kedro catalog create
cli. Default
types can be configured in the projects settings.py, to get these types rather
than MemoryDataSets
.
Table of Contents
Installation
pip install kedro-auto-catalog
Configuration
Configure the project defaults in src/<project_name>/settings.py
with this
dict.
AUTO_CATALOG = {
"directory": "data",
"subdirs": ["raw", "intermediate", "primary"],
"layers": ["raw", "intermediate", "primary"],
"default_extension": "parquet",
"default_type": "pandas.ParquetDataSet",
}
Usage
To auto create catalog entries for the __default__
pipeline, run this from the command line.
kedro auto-catalog -p __default__
If you want a reminder of what to do, use the --help
.
❯ kedro auto-catalog --help❯
Usage: kedro auto-catalog [OPTIONS]
Create Data Catalog YAML configuration with missing datasets.
Add configurable datasets to Data Catalog YAML configuration file for each
dataset in a registered pipeline if it is missing from the `DataCatalog`.
The catalog configuration will be saved to
`<conf_source>/<env>/catalog/<pipeline_name>.yml` file.
Configure the project defaults in `src/<project_name>/settings.py` with this
dict.
Options:
-e, --env TEXT Environment to create Data Catalog YAML file in.
Defaults to `base`.
-p, --pipeline TEXT Name of a pipeline. [required]
-h, --help Show this message and exit.
Example
Using the
kedro-spaceflights
example, running kedro auto-catalog -p __default__
yields the following
catalog in conf/base/catalog/__default__.yml
X_test:
filepath: data/X_test.pq
type: pandas.ParquetDataSet
X_train:
filepath: data/X_train.pq
type: pandas.ParquetDataSet
y_test:
filepath: data/y_test.parquet
type: pandas.ParquetDataSet
y_train:
filepath: data/y_train.parquet
type: pandas.ParquetDataSet
subdirs and layers
If we use the example configuration with "subdirs": ["raw", "intermediate", "primary"]
and "layers": ["raw", "intermediate", "primary"]
, it will convert
any leading subdir/layer in your dataset name into a directory. If we change y_test
to raw_y_test
, it will put y_test.parquet
in the raw
directory, and in the raw layer.
raw_y_test:
filepath: data/raw/y_test.parquet
layer: raw
type: pandas.ParquetDataSet
License
kedro-auto-catalog
is distributed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kedro_auto_catalog-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c74c62191ae8acc61e1f2a3b9d4ce8d64d50ef768ca846e1919263d128beb50 |
|
MD5 | 6903fa68c8245a629b99423c1e4eb779 |
|
BLAKE2b-256 | 5910ca9d7984ed4f33ac0ab596c6fa706193fa6671645363e4ccc92ed1d41e6f |