A configurable replacement for `kedro catalog create`.
Project description
Kedro Auto Catalog
A configurable version of the built in kedro catalog create
cli. Default
types can be configured in the projects settings.py, to get these types rather
than MemoryDataSets
.
Table of Contents
Installation
pip install kedro-auto-catalog
Configuration
Configure the project defaults in src/<project_name>/settings.py
with this
dict.
AUTO_CATALOG = {
"directory": "data",
"subdirs": ["raw", "intermediate", "primary"],
"layers": ["raw", "intermediate", "primary"],
"default_extension": "parquet",
"default_type": "pandas.ParquetDataSet",
}
Usage
To auto create catalog entries for the __default__
pipeline, run this from the command line.
kedro auto-catalog -p __default__
If you want a reminder of what to do, use the --help
.
❯ kedro auto-catalog --help❯
Usage: kedro auto-catalog [OPTIONS]
Create Data Catalog YAML configuration with missing datasets.
Add configurable datasets to Data Catalog YAML configuration file for each
dataset in a registered pipeline if it is missing from the `DataCatalog`.
The catalog configuration will be saved to
`<conf_source>/<env>/catalog/<pipeline_name>.yml` file.
Configure the project defaults in `src/<project_name>/settings.py` with this
dict.
Options:
-e, --env TEXT Environment to create Data Catalog YAML file in.
Defaults to `base`.
-p, --pipeline TEXT Name of a pipeline. [required]
-h, --help Show this message and exit.
Example
Using the
kedro-spaceflights
example, running kedro auto-catalog -p __default__
yields the following
catalog in conf/base/catalog/__default__.yml
X_test:
filepath: data/X_test.pq
type: pandas.ParquetDataSet
X_train:
filepath: data/X_train.pq
type: pandas.ParquetDataSet
y_test:
filepath: data/y_test.parquet
type: pandas.ParquetDataSet
y_train:
filepath: data/y_train.parquet
type: pandas.ParquetDataSet
subdirs and layers
If we use the example configuration with "subdirs": ["raw", "intermediate", "primary"]
and "layers": ["raw", "intermediate", "primary"]
, it will convert
any leading subdir/layer in your dataset name into a directory. If we change y_test
to raw_y_test
, it will put y_test.parquet
in the raw
directory, and in the raw layer.
raw_y_test:
filepath: data/raw/y_test.parquet
layer: raw
type: pandas.ParquetDataSet
License
kedro-auto-catalog
is distributed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kedro_auto_catalog-0.2.0.dev0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 508a08a538c0068654b24e7847c448f000f25d57918b5a9ef6ee1e9cc818a3a0 |
|
MD5 | 0798c8425350273851fad792464a21b0 |
|
BLAKE2b-256 | 04e783f19402aaaf5f22f0df2edf5d13a2151f1b1cf1612edc738c553194ba17 |
Hashes for kedro_auto_catalog-0.2.0.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 552ba545d63698f32a5aede134ac337bdfe16d488bcc158163e1d4e750325c77 |
|
MD5 | 4fe0e9eca5002fc3d60aa70547e80c43 |
|
BLAKE2b-256 | 8b50785a4bbfc163cbc26ed69c2c87327c5b7aa45305f14fb84ffdd4d0c787f0 |