icoscp_core
Project description
icoscp_core
A foundational ICOS Carbon Portal (CP) core products Python library for metadata and data access, designed to work with multiple data repositories who use ICOS Carbon Portal core server software stack to host and serve their data. At the moment, three repositories are supported: ICOS, SITES, and ICOS Cities.
Design goals
- offer basic functionality with good performance
- good alignment with the server APIs
- minimise dependencies (only depend on
numpy
anddacite
) - aim for good integration with
pandas
without depending on this package - provide a solid foundation for future versions of icoscp—an ICOS-specific meta- and data access library developed by the CP Elaborated Products team
- extensive use of type annotations and Python data classes, to safeguard against preventable bugs, both in the library itself, and in the tools and apps written on top of it; a goal is to satisfy the typechecker in strict mode
- usage of autogenerated data classes produced from Scala back end code representing various metadata entities (e.g. data objects, stations) and their parts
- simultaneous support for three cross-cutting concerns:
- multiple repositories (ICOS, SITES, ICOS Cities)
- multiple ways of authentication
- data access through the HTTP API (on an arbitrary machine) and through file system (on a Jupyter notebook with "backdoor" data access); in the latter case the library is responsible for reporting the data usage event.
Getting started
The library is available on PyPI, can be installed with pip
:
$ pip install icoscp_core
The code examples below are usually provided for ICOS. For other Repositories (SITES or ICOS Cities), in the import directives, use icoscp_core.sites
or icoscp_core.cities
, respectively, instead of icoscp_core.icos
.
Authentication
Metadata access does not require authentication, and is achieved by a simple import:
from icoscp_core.icos import meta
When using the library on an accordingly configured Jupyter notebook service hosted by the ICOS Carbon Portal, authentication is not required when using two of the data access methods:
get_columns_as_arrays
batch_get_columns_as_arrays
available on data
import from icoscp_core.icos
package. When using other data access methods, or when running the code outside ICOS Jupyter environment, or if the Jupyter environment has not been provisioned with filesystem data access to your Repository, all data access methods require authentication.
Authentication can be initialized in a number of ways.
Credentials and token cache file (default)
This approach should only be used on machines the developer trusts.
A username/password account with the respective authentication service (links for: ICOS, SITES, ICOS Cities) is required for this. Obfuscated (not readable by humans) password is stored in a file on the local machine in a default user-specific folder. To initialize this file, run the following code interactively (only needs to be once for every machine):
from icoscp_core.icos import auth
auth.init_config_file()
After the initialization step is done, access to the metadata and data services is achieved by a simple import:
from icoscp_core.icos import meta, data
As an alternative, the developer may choose to use a specific file to store the credentials and token cache. In this scenario, data
service needs to be initialized as follows:
from icoscp_core.icos import bootstrap
auth, meta, data = bootstrap.fromPasswordFile("<desired path to the file>")
# the next line needs to be run interactively (only once per file)
auth.init_config_file()
Static authentication token (prototyping)
This option is good for testing, on a public machine or in general. Its only disadvantage is that the tokens have limited period of validity (100000 seconds, less than 28 hours), but this is precisely what makes it acceptable to include them directly in the Python source code.
The token can be obtained from the "My Account" page (links for: ICOS, SITES, ICOS Cities), which can be accessed by logging in using one of the supported authentication mechanisms (username/password, university sign-in, OAuth sign in). After this the bootstrapping can be done as follows:
from icoscp_core.icos import bootstrap
cookie_token = 'cpauthToken=WzE2OTY2NzQ5OD...'
meta, data = bootstrap.fromCookieToken(cookie_token)
Explicit credentials (advanced option)
The user may choose to use their own mechanism of providing the credentials to initialize the authentication. This should be considered as an advanced option. (Please do not put your password as clear text in your Python code!) This can be achieved as follows:
from icoscp_core.icos import bootstrap
meta, data = bootstrap.fromCredentials(username_variable, password_containing_variable)
Metadata access
Metadata access requires no authentication, and is performed using an instance of MetadataClient
class easily obtainable through an import:
from icoscp_core.icos import meta
An important background information is that all the metadata-represented entities (data objects, data types, documents, collections, measurement stations, people, etc) are identified by URIs. The metadata-access methods usually accept these URIs as input arguments, and the returned values tend to be instances of Python dataclasses, which brings:
- better syntax in comparison with generic dictionaries (dot-notation attribute access instead of dictionary value access, for example
dobj_meta.specification.project.self.uri
instead ofdobj_meta["specification"]["project"]["self"]["uri"]
) - autocomplete of the dataclass attributes (works even in Jupyter notebooks)
- type checking, when developing with type annotations and a type checker (typically available from an IDE, but not from Jupyter)
The following code showcases the main metadata access methods.
Discover data types
# fetches the list of known data types, including metadata associated with them
all_datatypes = meta.list_datatypes()
# data types with structured data access
previewable_datatypes = [dt for dt in all_datatypes if dt.has_data_access]
Discover stations
from icoscp_core.icos import meta, ATMO_STATION
# fetch lists of stations, with basic metadata
icos_stations = meta.list_stations()
atmo_stations = meta.list_stations(ATMO_STATION)
all_known_stations = meta.list_stations(False)
# get detailed metadata for a station
htm_uri = 'http://meta.icos-cp.eu/resources/stations/AS_HTM'
htm_station_meta = meta.get_station_meta(htm_uri)
Find data objects
from icoscp_core.metaclient import TimeFilter, SizeFilter, SamplingHeightFilter
# list data objects with basic metadata
# a contrived, complicated example to demonstrate the possibilities
# all the arguments are optional
# see the Python help for the method for more details
filtered_atc_co2 = meta.list_data_objects(
datatype = [
"http://meta.icos-cp.eu/resources/cpmeta/atcCo2L2DataObject",
"http://meta.icos-cp.eu/resources/cpmeta/atcCo2NrtGrowingDataObject"
],
station = "http://meta.icos-cp.eu/resources/stations/AS_GAT",
filters = [
TimeFilter("submTime", ">", "2023-07-01T12:00:00Z"),
TimeFilter("submTime", "<", "2023-07-10T12:00:00Z"),
SizeFilter(">", 50000),
SamplingHeightFilter("=", 216)
],
include_deprecated = True,
order_by = "fileName",
limit = 50
)
Geospatial filtering of data objects
Similarly to TimeFilter
and SizeFilter
, GeoIntersectFilter
is available to filter the data objects by their geospatial coverage. It has a list of Point
s as the only constructor argument polygon
. For convenience of creation standard rectangular lat/lon bounding boxes, there is a helper method.
from .queries.dataobjlist import box_intersect
from .metaclient import GeoIntersectFilter
australian_model_archives = meta.list_data_objects(
datatype="http://meta.icos-cp.eu/resources/cpmeta/modelDataArchive",
filters=[box_intersect(Point(-40, 145), Point(-25, 155))]
)
Fetch detailed metadata for a single data object
dobj_uri = 'https://meta.icos-cp.eu/objects/BbEO5i3rDLhS_vR-eNNLjp3Q'
dobj_meta = meta.get_dobj_meta(dobj_uri)
Detailed help on the available metadata access methods can be obtained from help(meta)
call.
Repository-specific functionality
The majority of functionality of the library is common to all the supported data Repositories. However, in some cases Repository-specific reusable code may be useful. Such code is planned to be placed into corresponding packages. There is only one example of such code at the moment:
from icoscp_core.icos import station_class_lookup
htm_uri = 'http://meta.icos-cp.eu/resources/stations/AS_HTM'
htm_class = station_class_lookup()[htm_uri]
Data access
After having identified an interesting data object or a list of objects in the previous step, one can access their data content in a few ways. Data access is provided by an instance of DataClient
class most easily obtained by import
from icoscp_core.icos import data
The following are code examples showcasing the main data access methods.
Downloading original data object content
Given basic data object metadata (or just the URI id) one can download the original data to a folder like so:
filename = data.save_to_folder(dobj_uri, '/myhome/icosdata/')
The method requires authentication, even on ICOS Jupyter instances. Works on all data objects (all kinds, and regardless of variable metadata availability)
Station-specific time series
Station-specific time series, that have variable metadata associated with them, enjoy a higher level of support. The variables with metadata representation (which may be only a subset of the variables present in the original data) can be efficiently accessed using this library. For single-object access, a complete data object metadata is required. The output can be readily converted to a pandas DataFrame
, but can be used as is (a dictionary of numpy arrays). It is possible to explicitly limit variables for access, and to slice the time series.
Authentication may be optional on ICOS Jupyter instances.
import pandas as pd
# get dataset columns as typed arrays, ready to be imported into pandas
dobj_arrays = data.get_columns_as_arrays(dobj_meta, ['TIMESTAMP', 'co2'])
df = pd.DataFrame(dobj_arrays)
One way to distinguish the objects with structured data access is that their data types (used for filtering the data objects, see the metadata access section) have has_data_access
property equal to True
.
Batch data access
In many scripting scenarios, data objects are processed in batches of uniform data types. In this case, rather than using get_columns_as_arrays
method in a loop, it is much more efficient to use a special batch-access method. This will significantly reduce the number of round trips to the HTTP metadata service, greatly speeding up the operation:
multi_dobjs = data.batch_get_columns_as_arrays(filtered_atc_co2, ['TIMESTAMP', 'co2'])
where filtered_atc_co2
is a either a list from the metadata examples above, or just a list of plain data object URI IDs. The returned value is a generator of pairs, where first value is the basic data object metadata (or just a plain URI id, depending on what was used as the argument), and the second value is the same as the return value from get_columns_as_arrays
method (a dictionary of numpy arrays, with variable names as keys)
If it is desirable to convert the data to pandas DataFrame
s, it can be done like so:
import pandas as pd
multi_df = ( (dobj, pd.DataFrame(arrs)) for dobj, arrs in multi_dobjs)
CSV representation access
The data server offers (partial) CSV representations for fully-supported time series datasets. That service can be used from this library as follows:
import pandas as pd
csv_stream = data.get_csv_byte_stream(dobj_uri)
df = pd.read_csv(csv_stream)
but using get_columns_as_arrays
and batch_get_columns_as_arrays
is to be preferred for performance reasons, especially on ICOS Jupyter instances. Authentication is always required to use this method.
Advanced metadata access (SPARQL)
For general metadata enquiries not offered by the API explicitly, it is often possible to design a SPARQL query that would provide the required information. The query can be run with sparql_select
method of MetadataClient
, and the output of the latter can be parsed using "as_<rdf_datatype>
"-named methods in icoscp_core.sparql
module. For example:
from icoscp_core.icos import meta
from icoscp_core.sparql import as_string, as_uri
query = """prefix cpmeta: <http://meta.icos-cp.eu/ontologies/cpmeta/>
select *
from <http://meta.icos-cp.eu/documents/>
where{
?doc a cpmeta:DocumentObject .
FILTER NOT EXISTS {[] cpmeta:isNextVersionOf ?doc}
?doc cpmeta:hasDoi ?doi .
?doc cpmeta:hasName ?filename .
}"""
latest_docs_with_dois = [
{
"uri": as_uri("doc", row),
"filename": as_string("filename", row),
"doi": as_string("doi", row)
}
for row in meta.sparql_select(query).bindings
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for icoscp_core-0.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6a2d8ab0c3db172cd4a66a79b44ef7f9b39c0efe204d7ea4f2b5c259bb81619 |
|
MD5 | 487e7e77d5f9ea5c5d05a05ffebcba1f |
|
BLAKE2b-256 | 24428c0a6091beb474212302b2d736e656afed0786dc014e5c365c3a885943b4 |