Skip to main content

API for accessing the generalized Cloud Catalog (cloudcatalog) specification for sharing data in and across clouds

Project description

Cloud Catalog (cloudcatalog) Tool

This tool is designed for retrieving file catalog (index) files from a specific ID entry in a catalog within a bucket. It also includes search functionality for searching through all data index catalogs found in the bucket list.

Use Case

Suppose there is a mission on S3 that follows the HelioCloud 'Cloud Catalog' specification, and you want to obtain specific files from this mission.

Initial Setup and Global Catalog

First, install the tool if it has not been already installed. Then, import the tool into a script or shell. You will likely want to search the global catalog to find the specific bucket/catalog containing the data catalog files.

import cloudcatalog

# Create CatalogRegistry object which will by default pull from the Heliocloud global catalog
# or if an environment variable has been set for another global catalog, it will pull from there
cr = cloudcatalog.CatalogRegistry()

# Print out the entire global catalog
print(cr.get_catalog())

# Print out name + region of all global catalog entries
# If we know roughly what the name of the overarching bucket would be,
# this will help us find the exact name we need for the mission we want.
# Otherwise, other methods must be used to search for the bucket of interests.
print(cr.get_entries())

Finding and Requesting the File Catalog

At this point, you should have found the bucket containing the data of interest. Next, you will want to search the bucket-specific catalog (data catalog) for the ID representing the mission you want to obtain data for.

# With the bucket name we have obtained (possibly by using cr.get_endpoint(name, region_prefix=''))
bucket_name = 'a-bucket-name'
# If this is not a public bucket, you may need to pass access_key or other boto S3 client specific params to get the data
# cache_folder is only used if cache is True and defaults to `bucket_name + '_cache'`
fr = cloudcatalog.CloudCatalog(bucket_name, cache_folder=None, cache=True)  

# Print out the entire local catalog (datasets)
print(fr.get_catalog())

# To find the specific ID we can also get the ID + Title by
print(fr.get_entries())

# Now with the ID we can request the catalog index files
# This if successful, will get us a Pandas dataframe of the file index
# and if we previously had set cache to True in initialization, it will
# also save the downloaded file index
fr_id = 'a_dataset_id_from_the_catalog'
start_date = '2007-02-01T00:00:00Z'  # A ISO 8601 standard time and a valid time witin the mission/file-index
stop_date = None  # A ISO 8601 standard time or None if want all the file indices after start_date
myfiles = fr.request_cloud_catalog(fr_id, start_date=start_date, end_date=end_date, overwrite=False)

Streaming Data from the File Catalog

You now have a pandas DataFrame with startdate, stopdate, key, and filesize for all the files of the mission within your specified start and end dates. From here, you can use the key to stream some of the data through EC2, a Lambda, or other processing methods.

This tool also offers a simple function for streaming the data once the file catalog is obtained:

cloudcatalog.CloudCatalog.stream(cloud_catalog, lambda bfile, startdate, stopdate, filesize: print(len(bo.read()), filesize))

Searching the Entire Catalog

As an alternative to manually searching, you can use the EntireCatalogSearch class to find a catalog entry:

search = cloudcatalog.EntireCatalogSearch()
top_search_result = search.search_by_keywords(['vector', 'mission', 'useful'])[0]
# Prints out the top result with all the catalog info, including id, loc, startdate, etc.
print(top_search_result)

Terse example for an SDO fetch of the filelist for all the 94A EUV images (1,624,900 files)

import cloudcatalog
dataid = "aia_0094"
s3bucket="s3://gov-nasa-hdrl-data1/"
fr = cloudcatalog.CloudCatalog(s3bucket)
mySDOlist = fr.request_cloud_catalog(dataid,
	    start_date=fr.get_entry(dataid)['start'],
	    stop_date=fr.get_entry(dataid)['stop'])

Terse example for an MMS fetch of the filelist for all of a specific MMS item (64,383 files)

import cloudcatalog
dataid = "mms1_feeps_brst_electron"
s3bucket="s3://helio-public/"
fr = cloudcatalog.CloudCatalog(s3bucket)
myMMSlist = fr.request_cloud_catalog(dataid,
	    start_date=fr.get_entry(dataid)['start'],
	    stop_date=fr.get_entry(dataid)['stop'])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudcatalog-0.6.1.tar.gz (75.3 kB view details)

Uploaded Source

Built Distribution

cloudcatalog-0.6.1-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file cloudcatalog-0.6.1.tar.gz.

File metadata

  • Download URL: cloudcatalog-0.6.1.tar.gz
  • Upload date:
  • Size: 75.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for cloudcatalog-0.6.1.tar.gz
Algorithm Hash digest
SHA256 87a62a7e0fb80ae226c1f5acd31a6d8ccb8432202095f2cd18475f60b0f05384
MD5 e26fb908f84ddd7ae66118541c0b0047
BLAKE2b-256 5cc2785bda332e29dc552ff5dc5e90650eb171d5ce4ef7fb658a7a7c66401c3f

See more details on using hashes here.

File details

Details for the file cloudcatalog-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: cloudcatalog-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 20.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for cloudcatalog-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7b0fabde601cc9d702dc536e37887f0d9cbe7b9f574836e1e17447931d289f20
MD5 66ac79615583ad3e2a630ba0a388cb3c
BLAKE2b-256 3afc5e4638203a7d454efcb5d455ce9eed6d3921d2ced98eb1c2ab88b2815d99

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page