API for accessing the generalized CloudCatalog (cloudcatalog) specification for sharing data in and across clouds

These details have not been verified by PyPI

Project links

Project description

CloudCatalog (cloudcatalog) API

Indexing millions of files for easy, searchable yet serverless and decentralized access is hard. CloudCatalog is a lightweight CSV- and JSON-based indexing schema enabling HAPI-like "data ID + time range" queries on massive cloud datasets, and includes an implementation of the API and support tools in Python. Key goals include that (1) data owners control their own indices, (2) indices are static files to avoid incurring server costs, (3) searching is efficient and (4) indices are easily constructable and maintainable by the scientists/data-owners (the 'lazy' part). In addition to the FAIR principles of findability, accessibility, interoperability, and reusability, it is serverless and decentralized so that contributors can publish and update their open science data without the worries of external gatekeeping or server maintenance.

CloudCatalog is a generalized indexing specification for large cloud datasets.

For sharing datasets across cloud frameworks
Decentralized: data owners control their own data and access via JSON
RESTful & serverless (indices are flat CSV files alongside their datasets)
Removes need for doing slow/expensive disk ‘ls’ on large holdings
Searchable

The push to open science means many more published datasets, and finding and accessing is important to solve. CloudCatalog is an indexing method for sharing big datasets in cloud systems. It is scientist-friendly and it is easy to generate a set of indices. It uses static index files in time-ordered CSV format that are easy to fetch, easy to access via an API, and very low cost in both money and bandwidth needed to support. Metadata is kept in a simple JSON schema. We also provide a Python client toolset for scientists to access datasets that use CloudCatalog.

The CloudCatalog specification and tools are open source, created by the HelioCloud project, and already used for 2 Petabytes of publicly available NASA and scientist-contributed data. We hope the community continues to adopt this CloudCatalog standard (in github, linked off heliocloud.org).

For sharing datasets across cloud frameworks
Decentralized: data owners control their own data and access
RESTful & serverless (indices are flat files alongside their datasets)
Removes need for doing slow/expensive disk ‘ls’ on large holdings
Global registry JSON points to owner-controlled ‘buckets’
Uses minimal JSON to list metadata, CSV files for indices
Searchable
Public specification here on GitHub.

The Specification enables anyone to index a public dataset such that other users can find it and retrieve file listings in a cost-effective serverless fashion.

The API is designed for retrieving file catalog (index) files from a specific ID entry in a catalog within a bucket. It also includes search functionality for searching through all data index catalogs found in the bucket list.

Command-line tools

We also include command-line tools for creating and viewing the networked catalogs.

Viewing tools

cloudcatalog-tree: lists or returns list of all toplevel datasets and number of dataIDs available, fast
cloudcatalog-spider: as 'tree' plus lists valid years and number of files, slow

Generator/updater tools (beta, use at risk for now)

cloudcatalog-update-json: updates catalog.json using metadata from catalog_stub.json
cloudcatalog-update-csv: updates catalog.json using metadata from cat.csv
cloudcatalog-gui: GUI for selecting files for cloudcatalog-update-json
cloudcatalog-manifest2indices: tries to convert an AWS Manifest.csv to individual [dataID]_[YYYY].csv indices

Use Case

Suppose there is a mission on S3 that follows the HelioCloud 'CloudCatalog' specification, and you want to obtain specific files from this mission.

Initial Setup and Global Catalog

First, install the tool if it has not been already installed. Then, import the tool into a script or shell. You will likely want to search the global catalog to find the specific bucket/catalog containing the data catalog files. You first create a CatalogRegistry object to pull from the default global catalog. This lists buckets not datasets; each bucket owner retains direct ownership over which of their datasets they wish to expose to the public.

import cloudcatalog

cr = cloudcatalog.CatalogRegistry()
print(cr.get_catalog())

print(cr.get_entries())

Finding and Requesting the File Catalog

At this point, you should have found the bucket containing the data of interest. Next, you will want to search the bucket-specific catalog (data catalog) for the ID representing the mission you want to obtain data for.

bucket_name = cr.get_endpoint('e.g. Bucket Mnemonic') # or hard-code, e.g. 's3://mybucket'
# If not a public bucket, pass access_key or boto S3 client params to access it
fr = cloudcatalog.CloudCatalog(bucket_name)  

# Print out the entire local catalog (datasets)
print(fr.get_catalog())

# To find the specific ID we can also get the ID + Title by
print(fr.get_entries_id_title())

# Now with the ID we can request the catalog index files as a Pandas dataframe
fr_id = 'a_dataset_id_from_the_catalog'
start_date = '2007-02-01T00:00:00Z'  # A ISO 8601 standard time
stop_date = None  # A ISO 8601 standard time or None for everything after start_date
myfiles = fr.request_cloud_catalog(fr_id, start_date=start_date, end_date=end_date, overwrite=False)

Searching the Entire Catalog

You can use the EntireCatalogSearch class to find a catalog entry:

search = cloudcatalog.EntireCatalogSearch()
top_search_result = search.search_by_keywords(['vector', 'mission', 'useful'])[0]
print(top_search_result)

Specific example for an SDO fetch of the filelist for all the 94A EUV images (1,624,900 files)

import cloudcatalog
fr = cloudcatalog.CloudCatalog("s3://gov-nasa-hdrl-data1/")
dataid = "aia_0094"
start, stop = fr.get_entry(dataid)['start'], fr.get_entry(dataid)['stop']
mySDOlist = fr.request_cloud_catalog(dataid, start, stop)

Add-on example for an MMS fetch of the filelist for all of a specific MMS item (64,383 files)

dataid = "MMS1_FEEPS_BRST_L2_ELECTRON"
start, stop = fr.get_entry(dataid)['start'], fr.get_entry(dataid)['stop']
myMMSlist = fr.request_cloud_catalog(dataid, start, stop)

Streaming Data from the File Catalog

You now have a pandas DataFrame with startdate, stopdate, key, and filesize for all the files of the mission within your specified start and end dates. From here, you can use the key to stream some of the data through EC2, a Lambda, or other processing methods.

This tool also offers a simple function for streaming the data once the file catalog is obtained:

cloudcatalog.CloudCatalog.stream(cloud_catalog, lambda bfile, startdate, stopdate, filesize: print(len(bo.read()), filesize))

Full Notebook Tutorial

For an in-depth walkthrough using the CloudCatalog on NASA datasets, see CloudCatalog-Demo.ipynb

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.1

Mar 31, 2026

1.1.0

Jun 2, 2025

1.0.2

Feb 19, 2025

1.0.0

Sep 4, 2024

0.6.1

Aug 21, 2024

0.5

Mar 12, 2024

0.4

Oct 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudcatalog-1.2.1.tar.gz (497.9 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cloudcatalog-1.2.1-py3-none-any.whl (349.7 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file cloudcatalog-1.2.1.tar.gz.

File metadata

Download URL: cloudcatalog-1.2.1.tar.gz
Upload date: Mar 31, 2026
Size: 497.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for cloudcatalog-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`7ead497662ae54380c1cc6cfe35bbc38e87b0ae651dc31cd62c50c12a2b2f7b6`
MD5	`18d3f9169e6d87357debb024533b1c6a`
BLAKE2b-256	`feb15783de68ff3524b3a13d6dbf6e0aa5b930aaaedb5c6fc4e792a3b2ac3efe`

See more details on using hashes here.

File details

Details for the file cloudcatalog-1.2.1-py3-none-any.whl.

File metadata

Download URL: cloudcatalog-1.2.1-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 349.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for cloudcatalog-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36e7c6f7c8a9dde0df3e4c9324578baec8b324476200f6ea9196c3acc8b574c5`
MD5	`da4b88528cbf61f5efb70858f84823ad`
BLAKE2b-256	`666b22e5beda18853f5c63443dbd22eaeacf5b0d34dbdc5229017d7f12ea9327`

See more details on using hashes here.

cloudcatalog 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CloudCatalog (cloudcatalog) API

Command-line tools

Viewing tools

Generator/updater tools (beta, use at risk for now)

Use Case

Initial Setup and Global Catalog

Finding and Requesting the File Catalog

Searching the Entire Catalog

Specific example for an SDO fetch of the filelist for all the 94A EUV images (1,624,900 files)

Add-on example for an MMS fetch of the filelist for all of a specific MMS item (64,383 files)

Streaming Data from the File Catalog

Full Notebook Tutorial

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes