Skip to main content

Intake plugin for specifying a file-path pattern which can represent a number of different entries

Project description

Intake Pattern Catalog

intake-pattern-catalog is a plugin for Intake which allows you to specify a file-path pattern which can represent a number of different entries.

Note that this is different from the patterns you can write with the csv driver which get turned into a single entry

Installation instructions

pip install intake-pattern-catalog
# or
conda install intake-pattern-catalog

Usage

Use driver: pattern_cat to use this driver in your catalogs.

Consider the following list of files in an S3 bucket:

  • bucket-name/folder/a_1.csv
  • bucket-name/folder/b_1.csv
  • bucket-name/folder/c_1.csv
  • bucket-name/folder/a_2.csv
  • bucket-name/folder/b_2.csv

And the following catalog definition yaml file:

---
metadata:
  version: 1
sources:
  stuff:
    description: Stuff and things
    driver: pattern_cat
    args:
      urlpath: "s3://bucket-name/folder/{foo}_{bar}.csv"
      driver: csv

Derived datasets

If you would like to create a derived dataset based on a pattern_cat dataset, you can use driver: pattern_cat_transform, which will apply a transformation function to each entry returned by get_entry. For example, you can add to the above example yaml file:

  stuff_transformed:
    description: Everything in stuff, doubled
    driver: pattern_cat_transform
    args:
      targets:
        - stuff
      transform: "path.to.doubling_function"

Catalog API

Access entry by kwargs:

> catalog.stuff.get_entry(foo='a', bar=1)
sources:
  foo_a_bar_1:
    args:
      storage_options:
        use_listings_cache: false
      urlpath: s3://bucket-name/folder/a_1.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata:
      catalog_dir: ...

Note that this could also be accessed with catalog.stuff.foo_a_bar_1

See all valid kwarg combinations:

> catalog.stuff.get_entry_kwarg_sets()
[
    {"foo": "a", "bar": "1"},
    {"foo": "b", "bar": "1"},
    {"foo": "c", "bar": "1"},
    {"foo": "a", "bar": "2"},
    {"foo": "b", "bar": "2"},
]

Caching

The default way of controlling any caching with a pattern-catalog is using a ttl (in seconds), which is an optional value under args which specifies how long should wait after fetching a list of files which match the pattern before it loads them again. The default ttl is 60 seconds. If you want to force it to always get the latest list of available entries, set the ttl to 0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intake-pattern-catalog-2021.9.0.tar.gz (10.5 kB view hashes)

Uploaded Source

Built Distribution

intake_pattern_catalog-2021.9.0-py2.py3-none-any.whl (10.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page