Skip to main content

Intake plugin for specifying a file-path pattern which can represent a number of different entries

Project description

Intake Pattern Catalog

Available on pypi

intake-pattern-catalog is a plugin for Intake which allows you to specify a file-path pattern which can represent a number of different entries.

Note that this is different from the patterns you can write with the csv driver which get turned into a single entry

Installation instructions

pip install intake-pattern-catalog
# or
conda install intake-pattern-catalog

Usage

Use driver: pattern_cat to use this driver in your catalogs.

Consider the following list of files in an S3 bucket:

  • bucket-name/folder/a_1.csv
  • bucket-name/folder/b_1.csv
  • bucket-name/folder/c_1.csv
  • bucket-name/folder/a_2.csv
  • bucket-name/folder/b_2.csv

And the following catalog definition yaml file:

---
metadata:
  version: 1
sources:
  stuff:
    description: Stuff and things
    driver: pattern_cat
    args:
      urlpath: "s3://bucket-name/folder/{foo}_{bar}.csv"
      driver: csv

Derived datasets

If you would like to create a derived dataset based on a pattern_cat dataset, you can use driver: pattern_cat_transform, which will apply a transformation function to each entry returned by get_entry. For example, you can add to the above example yaml file:

  stuff_transformed:
    description: Everything in stuff, doubled
    driver: pattern_cat_transform
    args:
      targets:
        - stuff
      transform: "path.to.doubling_function"

Catalog API

Access entry by kwargs:

> catalog.stuff.get_entry(foo='a', bar=1)
sources:
  foo_a_bar_1:
    args:
      storage_options:
        use_listings_cache: false
      urlpath: s3://bucket-name/folder/a_1.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata:
      catalog_dir: ...

Note that this could also be accessed with catalog.stuff.foo_a_bar_1

See all valid kwarg combinations:

> catalog.stuff.get_entry_kwarg_sets()
[
    {"foo": "a", "bar": "1"},
    {"foo": "b", "bar": "1"},
    {"foo": "c", "bar": "1"},
    {"foo": "a", "bar": "2"},
    {"foo": "b", "bar": "2"},
]

Caching

The default way of controlling any caching with a pattern-catalog is using a ttl (in seconds), which is an optional value under args which specifies how long should wait after fetching a list of files which match the pattern before it loads them again. The default ttl is 60 seconds. If you want to force it to always get the latest list of available entries, set the ttl to 0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intake-pattern-catalog-2023.3.0.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file intake-pattern-catalog-2023.3.0.tar.gz.

File metadata

File hashes

Hashes for intake-pattern-catalog-2023.3.0.tar.gz
Algorithm Hash digest
SHA256 f237dc3005502b9d1b80b54585ae2fe8a2e93e454646df198d7100a0b76c1be7
MD5 25fc7178a087f82602b4bf7ee6c5b431
BLAKE2b-256 a14ec09f17c02a32c88f178232beb8ebae41ca4d96a05302a8b26e6dbfd7279b

See more details on using hashes here.

File details

Details for the file intake_pattern_catalog-2023.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for intake_pattern_catalog-2023.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 44a53804f53effca3903b123a7f37fbf9e606765a5ac1fa23e76f7f2c5f0b2f7
MD5 fc87cd095e95af1a375144e818d869b4
BLAKE2b-256 d779c597f749849ced821784848a23c95f5565819cb8a4b3ebcc29a93804a42b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page