Intake plugin for specifying a file-path pattern which can represent a number of different entries
Project description
Intake Pattern Catalog
intake-pattern-catalog is a plugin for Intake which allows you to specify a file-path pattern which can represent a number of different entries.
Note that this is different from the patterns you can write with the csv driver which get turned into a single entry
Installation instructions
pip install intake-pattern-catalog
# or
conda install intake-pattern-catalog
Usage
Use driver: pattern_cat
to use this driver in your catalogs.
Consider the following list of files in an S3 bucket:
- bucket-name/folder/a_1.csv
- bucket-name/folder/b_1.csv
- bucket-name/folder/c_1.csv
- bucket-name/folder/a_2.csv
- bucket-name/folder/b_2.csv
And the following catalog definition yaml file:
---
metadata:
version: 1
sources:
stuff:
description: Stuff and things
driver: pattern_cat
args:
urlpath: "s3://bucket-name/folder/{foo}_{bar}.csv"
driver: csv
Derived datasets
If you would like to create a
derived dataset based on a
pattern_cat
dataset, you can use driver: pattern_cat_transform
, which will apply
a transformation function to each entry returned by get_entry
. For example, you can
add to the above example yaml file:
stuff_transformed:
description: Everything in stuff, doubled
driver: pattern_cat_transform
args:
targets:
- stuff
transform: "path.to.doubling_function"
Catalog API
Access entry by kwargs:
> catalog.stuff.get_entry(foo='a', bar=1)
sources:
foo_a_bar_1:
args:
storage_options:
use_listings_cache: false
urlpath: s3://bucket-name/folder/a_1.csv
description: ''
driver: intake.source.csv.CSVSource
metadata:
catalog_dir: ...
Note that this could also be accessed with catalog.stuff.foo_a_bar_1
See all valid kwarg combinations:
> catalog.stuff.get_entry_kwarg_sets()
[
{"foo": "a", "bar": "1"},
{"foo": "b", "bar": "1"},
{"foo": "c", "bar": "1"},
{"foo": "a", "bar": "2"},
{"foo": "b", "bar": "2"},
]
Caching
The default way of controlling any caching with a pattern-catalog is using a ttl
(in seconds),
which is an optional value under args
which specifies how long should wait after fetching a list of files
which match the pattern before it loads them again. The default ttl
is 60 seconds.
If you want to force it to always get the latest list of available entries, set the ttl
to 0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file intake-pattern-catalog-2023.3.0.tar.gz
.
File metadata
- Download URL: intake-pattern-catalog-2023.3.0.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f237dc3005502b9d1b80b54585ae2fe8a2e93e454646df198d7100a0b76c1be7 |
|
MD5 | 25fc7178a087f82602b4bf7ee6c5b431 |
|
BLAKE2b-256 | a14ec09f17c02a32c88f178232beb8ebae41ca4d96a05302a8b26e6dbfd7279b |
File details
Details for the file intake_pattern_catalog-2023.3.0-py3-none-any.whl
.
File metadata
- Download URL: intake_pattern_catalog-2023.3.0-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44a53804f53effca3903b123a7f37fbf9e606765a5ac1fa23e76f7f2c5f0b2f7 |
|
MD5 | fc87cd095e95af1a375144e818d869b4 |
|
BLAKE2b-256 | d779c597f749849ced821784848a23c95f5565819cb8a4b3ebcc29a93804a42b |