Skip to main content

icoscp_core

Project description

icoscp_core

A foundational ICOS Carbon Portal (CP) core products Python library for metadata and data access, designed to work with multiple data repositories who use ICOS Carbon Portal core server software stack to host and serve their data. At the moment, three repositories are supported: ICOS, SITES, and ICOS Cities.

Design goals

  • offer basic functionality with good performance
  • good alignment with the server APIs and ICOS metadata model
  • minimise dependencies (only depend on numpy and dacite)
  • aim for good integration with pandas without depending on this package
  • provide a solid foundation for future versions of icoscp—an ICOS-specific meta- and data access library developed by the CP Elaborated Products team
  • extensive use of type annotations and Python data classes, to safeguard against preventable bugs, both in the library itself, and in the tools and apps written on top of it; a goal is to satisfy the typechecker in strict mode
  • usage of autogenerated data classes produced from Scala back end code representing various metadata entities (e.g. data objects, stations) and their parts
  • simultaneous support for three cross-cutting concerns:
    • multiple repositories (ICOS, SITES, ICOS Cities)
    • multiple ways of authentication
    • data access through the HTTP API (on an arbitrary machine) and through file system (on a Jupyter notebook with "backdoor" data access); in the latter case the library is responsible for reporting the data usage event.

Getting started

The library is available on PyPI, can be installed with pip:

$ pip install icoscp_core

The code examples below are usually provided for ICOS. For other Repositories (SITES or ICOS Cities), in the import directives, use icoscp_core.sites or icoscp_core.cities, respectively, instead of icoscp_core.icos.

Authentication

Metadata access does not require authentication, and is achieved by a simple import:

from icoscp_core.icos import meta

Additionally, when using the library on an accordingly configured Jupyter notebook service hosted by the ICOS Carbon Portal, authentication is not required when using two of the data access methods:

  • get_columns_as_arrays
  • batch_get_columns_as_arrays

available on data import from icoscp_core.icos package.

When using other data access methods, or when running the code outside ICOS Jupyter environment, or if the Jupyter environment has not been provisioned with file access to your Repository, authentication is required for the data access.

Authentication can be initialized in a number of ways.

Credentials and token cache file (default)

This approach should only be used on machines the developer trusts.

A username/password account with the respective authentication service (links for: ICOS, SITES, ICOS Cities) is required for this. Obfuscated (not readable by humans) password is stored in a file on the local machine in a default user-specific folder. To initialize this file, run the following code interactively (only needs to be once for every machine):

from icoscp_core.icos import auth

auth.init_config_file()

After the initialization step is done, access to the metadata and data services is achieved by a simple import:

from icoscp_core.icos import meta, data

As an alternative, the developer may choose to use a specific file to store the credentials and token cache. In this scenario, data service needs to be initialized as follows:

from icoscp_core.icos import bootstrap
auth, meta, data = bootstrap.fromPasswordFile("<desired path to the file>")

# the next line needs to be run interactively (only once per file)
auth.init_config_file()

Static authentication token (prototyping)

This option is good for testing, on a public machine or in general. Its only disadvantage is that the tokens have limited period of validity (100000 seconds, less than 28 hours), but this is precisely what makes it acceptable to include them directly in the Python source code.

The token can be obtained from the "My Account" page (links for: ICOS, SITES, ICOS Cities), which can be accessed by logging in using one of the supported authentication mechanisms (username/password, university sign-in, OAuth sign in). After this the bootstrapping can be done as follows:

from icoscp_core.icos import bootstrap
cookie_token = 'cpauthToken=WzE2OTY2NzQ5OD...'
meta, data = bootstrap.fromCookieToken(cookie_token)

Explicit credentials (advanced option)

The user may choose to use their own mechanism of providing the credentials to initialize the authentication. This should be considered as an advanced option. (Please do not put your password as clear text in your Python code!) This can be achieved as follows:

from icoscp_core.icos import bootstrap
meta, data = bootstrap.fromCredentials(username_variable, password_containing_variable)

Metadata access

Metadata access requires no authentication, and is performed using an instance of MetadataClient class easily obtainable through an import:

from icoscp_core.icos import meta

An important background information is that all the metadata-represented entities (data objects, data types, documents, collections, measurement stations, people, etc) are identified by URIs. The metadata-access methods usually accept these URIs as input arguments, and the returned values tend to be instances of Python dataclasses, which brings:

  • better syntax in comparison with generic dictionaries (dot-notation attribute access instead of dictionary value access, for example dobj_meta.specification.project.self.uri instead of dobj_meta["specification"]["project"]["self"]["uri"])
  • autocomplete of the dataclass attributes (works even in Jupyter notebooks)
  • type checking, when developing with type annotations and a type checker (typically available from an IDE, but not from Jupyter)

The following code showcases the main metadata access methods.

Discover data types

# fetches the list of known data types, including metadata associated with them
all_datatypes = meta.list_datatypes()

# data types with structured data access
previewable_datatypes = [dt for dt in all_datatypes if dt.has_data_access]

Discover stations

from icoscp_core.icos import meta, ATMO_STATION

# fetch lists of stations, with basic metadata
icos_stations = meta.list_stations()
atmo_stations = meta.list_stations(ATMO_STATION)
all_known_stations = meta.list_stations(False)

# get detailed metadata for a station
htm_uri = 'http://meta.icos-cp.eu/resources/stations/AS_HTM'
htm_station_meta = meta.get_station_meta(htm_uri)

Find data objects

from icoscp_core.metaclient import TimeFilter, SizeFilter, SamplingHeightFilter

# list data objects with basic metadata
# a contrived, complicated example to demonstrate the possibilities
# all the arguments are optional
# see the Python help for the method for more details
filtered_atc_co2 = meta.list_data_objects(
	datatype = [
		"http://meta.icos-cp.eu/resources/cpmeta/atcCo2L2DataObject",
		"http://meta.icos-cp.eu/resources/cpmeta/atcCo2NrtGrowingDataObject"
	],
	station = "http://meta.icos-cp.eu/resources/stations/AS_GAT",
	filters = [
		TimeFilter("submTime", ">", "2023-07-01T12:00:00Z"),
		TimeFilter("submTime", "<", "2023-07-10T12:00:00Z"),
		SizeFilter(">", 50000),
		SamplingHeightFilter("=", 216)
	],
	include_deprecated = True,
	order_by = "fileName",
	limit = 50
)

Geospatial filtering of data objects

Similarly to TimeFilter and SizeFilter, GeoIntersectFilter is available to filter the data objects by their geospatial coverage, specifically by filtering the objects whose geo covarage intersects a region of interest, which can be represented by a polygon. GeoIntersectFilter has a list of Points as the only constructor argument polygon.

from icoscp_core.metaclient import Point, GeoIntersectFilter

la_reunion_co2 = meta.list_data_objects(
	datatype="http://meta.icos-cp.eu/resources/cpmeta/atcCo2Product",
	filters=[
		GeoIntersectFilter([
			Point(-21.46555, 54.90857),
			Point(-20.65176, 55.423563),
			Point(-21.408027, 56.231058)
		])
	]
)

For convenience of creation standard rectangular lat/lon bounding boxes, there is a helper method box_intersect that takes two points as arguments (south-western and north-eastern corners of the box):

from icoscp_core.metaclient import Point, box_intersect

sydney_model_data_archives = meta.list_data_objects(
	datatype="http://meta.icos-cp.eu/resources/cpmeta/modelDataArchive",
	filters=[box_intersect(Point(-40, 145), Point(-25, 155))]
)

Fetch detailed metadata for a single data object

dobj_uri = 'https://meta.icos-cp.eu/objects/BbEO5i3rDLhS_vR-eNNLjp3Q'
dobj_meta = meta.get_dobj_meta(dobj_uri)

Fetch metadata for a collection

Some data objects belong to collections. Collections can also contain other collections. Collections can be discovered on the data portal app, or from individual data object metadata (as parent collections), for example:

dobj = meta.get_dobj_meta('https://meta.icos-cp.eu/objects/hujSGCfmNIRdxtOcEvEJLxGM')
coll_uri = dobj.parentCollections[0].uri
coll_meta = meta.get_collection_meta(coll_uri)

Note

Detailed help on the available metadata access methods can be obtained from help(meta) call.

Repository-specific functionality

The majority of functionality of the library is common to all the supported data Repositories. However, in some cases Repository-specific reusable code may be useful. Such code is planned to be placed into corresponding packages. There is only one example of such code at the moment:

from icoscp_core.icos import station_class_lookup
htm_uri = 'http://meta.icos-cp.eu/resources/stations/AS_HTM'
htm_class = station_class_lookup()[htm_uri]

Data access

After having identified an interesting data object or a list of objects in the previous step, one can access their data content in a few ways. Data access is provided by an instance of DataClient class most easily obtained by import

from icoscp_core.icos import data

The following are code examples showcasing the main data access methods.

Downloading original data object content

Given basic data object metadata (or just the URI id) one can download the original data to a folder like so:

filename = data.save_to_folder(dobj_uri, '/myhome/icosdata/')

The method requires authentication, even on ICOS Jupyter instances. Works on all data objects (all kinds, and regardless of variable metadata availability)

Station-specific time series

Station-specific time series, that have variable metadata associated with them, enjoy a higher level of support. The variables with metadata representation (which may be only a subset of the variables present in the original data) can be efficiently accessed using this library. For single-object access, a complete data object metadata is required. The output can be readily converted to a pandas DataFrame, but can be used as is (a dictionary of numpy arrays). It is possible to explicitly limit variables for access, and to slice the time series.

Authentication may be optional on ICOS Jupyter instances.

import pandas as pd
# get dataset columns as typed arrays, ready to be imported into pandas
dobj_arrays = data.get_columns_as_arrays(dobj_meta, ['TIMESTAMP', 'co2'])
df = pd.DataFrame(dobj_arrays)

One way to distinguish the objects with structured data access is that their data types (used for filtering the data objects, see the metadata access section) have has_data_access property equal to True.

Batch data access

In many scripting scenarios, data objects are processed in batches of uniform data types. In this case, rather than using get_columns_as_arrays method in a loop, it is much more efficient to use a special batch-access method. This will significantly reduce the number of round trips to the HTTP metadata service, greatly speeding up the operation:

multi_dobjs = data.batch_get_columns_as_arrays(filtered_atc_co2, ['TIMESTAMP', 'co2'])

where filtered_atc_co2 is a either a list from the metadata examples above, or just a list of plain data object URI IDs. The returned value is a generator of pairs, where first value is the basic data object metadata (or just a plain URI id, depending on what was used as the argument), and the second value is the same as the return value from get_columns_as_arrays method (a dictionary of numpy arrays, with variable names as keys)

If it is desirable to convert the data to pandas DataFrames, it can be done like so:

import pandas as pd
multi_df = ( (dobj, pd.DataFrame(arrs)) for dobj, arrs in multi_dobjs)

CSV representation access

The data server offers (partial) CSV representations for fully-supported time series datasets. That service can be used from this library as follows:

import pandas as pd
csv_stream = data.get_csv_byte_stream(dobj_uri)
df = pd.read_csv(csv_stream)

but using get_columns_as_arrays and batch_get_columns_as_arrays is to be preferred for performance reasons, especially on ICOS Jupyter instances. Authentication is always required to use this method.

Advanced metadata access (SPARQL)

For general metadata enquiries not offered by the API explicitly, it is often possible to design a SPARQL query that would provide the required information. The query can be run with sparql_select method of MetadataClient, and the output of the latter can be parsed using "as_<rdf_datatype>"-named methods in icoscp_core.sparql module. For example:

from icoscp_core.icos import meta
from icoscp_core.sparql import as_string, as_uri

query = """prefix cpmeta: <http://meta.icos-cp.eu/ontologies/cpmeta/>
	select *
	from <http://meta.icos-cp.eu/documents/>
	where{
		?doc a cpmeta:DocumentObject .
		FILTER NOT EXISTS {[] cpmeta:isNextVersionOf ?doc}
		?doc cpmeta:hasDoi ?doi .
		?doc cpmeta:hasName ?filename .
	}"""
latest_docs_with_dois = [
	{
		"uri": as_uri("doc", row),
		"filename": as_string("filename", row),
		"doi": as_string("doi", row)
	}
	for row in meta.sparql_select(query).bindings
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icoscp_core-0.3.4.tar.gz (44.4 kB view hashes)

Uploaded Source

Built Distribution

icoscp_core-0.3.4-py3-none-any.whl (46.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page