Helper to download and subset sparse data that has been Arcoified and are available through STAC and sqlite formated data
Project description
arcosparse: A Python library for ARCO sparse datasets subsetting
Disclaimer
It is not recommended to use the arcosparse library directly.
Instead, if you want to work with sparse datasets, use the copernicusmarine Toolbox or tools like earthkit.
Issues on the repository are welcome and we will do our best to answer them.
Usage
[!WARNING] This library is still in development. Breaking changes might be introduced from version
0.y.zto0.y+1.z.
Main functions
arcosparse.subset_and_return_dataframe
Subset the data based on the input and return a dataframe.
arcosparse.subset_and_save
Subset the data based on the input and return data as a partitioned parquet file.
It means that the data is saved in one folder and in this folder there are many small parquet files. Though, you can open all the data at once.
To open the data into a dataframe, use this snippet:
import glob
output_path = "some_folder"
# Get all partitioned Parquet files
parquet_files = glob.glob(f"{output_path}/*.parquet")
# # Read all files into a single dataframe
df = pd.concat(pd.read_parquet(file) for file in parquet_files)
arcosparse.get_entities
A function to get the metadata about the entities that are available in the dataset. Since all the information is retrieved from the metadata, the argument is the url_metadata, the same used for the subset.
Returns a list of arcosparse.Entity. It contains information about the entities available in the dataset:
entity_id: same as theentity_idcolumn in the result of a subset.entity_type: same as theentity_typecolumn in the result of a subset.doi: the DOI of the entity.institution: the institution associated with the entity.institution_edmo_code: the EDMO code of the institution associated with the entity.
arcosparse.get_dataset_metadata
A function to get the metadata about the dataset. Since all the information is retrieved from the metadata, the argument is the url_metadata, the same used for the subset.
Returns an object arcosparse.Dataset. It contains information about the dataset:
dataset_id: the ID of the dataset.variables: a list of the names of the variables available in the dataset.assets: a list of the names of the assets available in the dataset.coordinates: a list ofarcosparse.DatasetCoordinateobjects. Each object contains the following information:coordinate_id: the ID of the coordinate.unit: the unit of the coordinate.minimum: the minimum value of the coordinate.maximum: the maximum value of the coordinate.step: the step of the coordinate.values: the values of the coordinate.
Authentication
You may need to authenticate to access some datasets, particularly when working with ECMWF data.
To do so, use the user_configuration argument, which accepts an arcosparse.UserConfiguration instance containing the following fields:
auth_token: The token used to authenticate requests. It is passed as theAuthorization: Bearer {auth_token}header.
Example:
import arcosparse
user_configuration = arcosparse.UserConfiguration(
auth_token="my_token"
)
df = arcosparse.subset_and_return_dataframe(
url_metadata="https://example.com/metadata.json",
minimum_latitude=10,
maximum_latitude=20,
minimum_longitude=30,
maximum_longitude=40,
minimum_time="2020-01-01T00:00:00Z",
maximum_time="2020-12-31T23:59:59Z",
minimum_elevation=0,
maximum_elevation=1000,
variables=["temperature", "precipitation"],
user_configuration=user_configuration
)
Note that STAC catalogues are typically public, so arcosparse will request the catalogue without authentication. However, any asset links found within the catalogue will be authenticated using the token provided in auth_token, if one is supplied.
Changelog
0.5.1
0.5.1: New features
- Add some metadata retrieved about platforms in the
arcosparse.Entityobject. Now it contains theinstitution_edmo_codeassociated with the entity.
0.5.0
0.5.0: Breaking Changes
- Deleted
disable_progress_barargument in the functionssubset_and_return_dataframeandsubset_and_save. Useprogress_bar_configuration={"disable": True}instead.
0.5.0: New features
pandas>=3is now available.- Add a way to handle metadata in chunks. Now capable of reading overflow chunks.
- Change license to EUPL-1.2.
- Can authenticate the requests to the assets with a token provided in
auth_tokeninuser_configuration. It is passed as theAuthorization: Bearer {auth_token}header. See the "Authentication" section in the doc for more details. arcosparsegot public. The repository is now open.
0.4.2
0.4.2: Bug fixes
- Fix a bug where dates in the metadata like "2025-06-25T07:43:54.514180Z" would not be parsed and raised an error. Now, it uses
dateutil.parserto parse the date strings correctly.
0.4.1
0.4.1: New features
- Added function
get_dataset_metadata. It returns anarcosparse.Datasetobject.
0.4.0
Breaking Changes
- Deleted function
get_entities_ids. Useget_entitiesas a replacement. Example:
# old code
my_entities = get_entities_ids(url_metadata)
# new code
my_entities = [entity.entity_id for entity in get_entities(url_metadata)]
New features
- Added function
get_entities. It returns a list ofEntityobjects.
Bug fixes
- Fix a bug where arcosparse would modify the dict that users input in the
columns_renameargument. Now, it deepcopy it to modify it after that.
0.3.5
- Return all the columns even if full of NaNs.
0.3.4
- Deleted deprecated
get_platforms_namesfunction - Fix an issue when query on the chunk would not be correct if the requested subset is 0.
0.3.3
- Add GPLv3 license
0.3.2
- Fixes an issue on Windows where deleting a file is not permited if we don't close explicitly the sql connection.
0.3.1
- Reindex when concatenate. Fixes issue when indexes wouldn't be unique.
- Fixes an issue on Windows where
datetime.to_timestampdoes not support dates before 1970-1-1 (i.e. negative values for timestamps). - Fixes an issue on Windows where a temporary sqlite file cannot be opened while it's already open in the process.
0.3.0
- Change columns output: from "platform_id" to "entity_id" and from "platform_type" to "entity_type".
- Document the expected column names in the doc of the functions.
- Add
columns_renameargument tosubset_and_return_dataframeandsubset_and_saveto be able to choose the names of the columns in the output.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arcosparse-0.5.1.tar.gz.
File metadata
- Download URL: arcosparse-0.5.1.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.14.0-1017-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf0357ec9a19f2cac530b8473e425a5bb7c04176cfd3af202c11bccde5aa1f69
|
|
| MD5 |
89a39dcf81e2c2bfe1e6841d4d39464b
|
|
| BLAKE2b-256 |
1f9cd1b8899d1e0e3374c48773067a04647c27f35eccc5efc14885f23b226dd6
|
File details
Details for the file arcosparse-0.5.1-py3-none-any.whl.
File metadata
- Download URL: arcosparse-0.5.1-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.14.0-1017-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a223e0a2f29da04f8b216799fc5ef610efb1bf94fb1507d725ff9c807d9637e2
|
|
| MD5 |
5d4fbb2a9c5b657b79ac33fc7bd225ad
|
|
| BLAKE2b-256 |
2547d0254d730fdba9726eb4407e041f0897c4072a5562e06e1762f7ef01c667
|