CLI tool that bulk exports DataCite metadata records for a specific repository to an S3 bucket.
Project description
datacite-websnap
CLI tool that bulk exports DataCite metadata records for a specific repository to an S3 bucket.
Also supports exporting repository records to a local machine.
Purpose
datacite-websnap was developed to facilitate interoperability between the data platforms of the ETH research institutions in Switzerland.
datacite-websnap empowers research institutions to share their DataCite metadata records by exporting the records to publicly accessible S3 cloud storage.
Installation
pip install datacite-websnap
Terminal Documentation
To access CLI documentation:
datacite-websnap --help
To access more detailed documentation for the export command:
datacite-websnap export --help
CLI Options
Click to unfold
Command: export
Bulk export DataCite XML metadata records that correspond to the records for a particular DataCite repository and/or DOI prefix.
The default behavior is to export DataCite XML records to an S3 bucket but command also supports exporting the records to a local machine.
| Option | Default | Description |
|---|---|---|
--doi-prefix |
None |
|
--client-id |
None |
|
--destination |
S3 |
|
--bucket |
None |
|
--key-prefix |
None |
|
--directory-path |
None |
|
--file-logs |
False |
|
--log-level |
INFO |
|
--early-exit |
False |
|
--api-url |
https://api.datacite.org |
|
--page-size |
250 |
|
DataCite Filters
Click to unfold
Repository account ID and DOI prefix are the supported filters used to select DataCite records that will be exported.
The filters can be applied for both S3 bucket and local machine usage.
Repository Account ID
Please note that applying this filter will bulk export ALL records for the specified repository account ID!
Repositories with records on DataCite each have their own DataCite repository account ID.
To confirm you have the correct repository ID you can call the DataCite API client endpoint.
If you do not know the repository ID but do know a specific DOI that belongs to the repository:
- Navigate to DataCite Commons
- Enter the DOI in the search box. For example: 10.16904/envidat.576
- Click on the record and then click "Download Metadata", select "DataCite JSON"
- The repository account ID is the value for
"clientId". For DOI 10.16904/envidat.576 the"clientId"value is"ethz.wsl".
Example usage as a command line argument: --client-id ethz.wsl
DOI Prefix
Please note that applying this filter will bulk export ALL records for the specified DOI prefix!
Records can also be exported by their DOI prefix.
The --doi-prefix argument accepts single or multiple prefix arguments.
Example usage as a command line argument: --doi-prefix 10.16904 --doi-prefix 10.25678
It can also be combined with the --client-id argument.
Usage: S3 Bucket
Click to unfold
Utilizes the AWS SDK for Python (Boto3) to export DataCite XML metadata records for a specific repository and/or DOI prefix as objects in an S3 bucket.
Environment Variables
The environment variables listed below are required to export records to an S3 bucket.
| Environment Variable | Description |
|---|---|
ENDPOINT_URL |
URL to use for the constructed S3 client |
AWS_ACCESS_KEY_ID |
AWS access key ID |
AWS_SECRET_ACCESS_KEY |
AWS secret access key |
Supports setting environment variables in a .env file.
The .env file must be located in the directory where the CLI is being executed.
For example, if you are running the program from my-drive/cli-tools/datacite-websnap then the .env file must be in that directory.
Example .env file:
ENDPOINT_URL=https://dreamycloud.com
AWS_ACCESS_KEY_ID=1234567abcdefg
AWS_SECRET_ACCESS_KEY=hijklmn1234567
Examples
To export the records to an S3 bucket:
--bucketoption must be assigned to an existing S3 bucket
Basic Usage
- Return all DataCite records for the EnviDat repository (using client-id
ethz.wsl) - Write XML records to a bucket called "opendataswiss"
datacite-websnap export --client-id ethz.wsl --bucket opendataswiss
Advanced Usage
- Return all DataCite records for the EnviDat repository (using client-id
ethz.wsl) - Write XML records to a bucket called "opendataswiss"
- Use key prefix
wsl - Enable logging to a file
datacite-websnap export --client-id ethz.wsl --bucket opendataswiss --key-prefix wsl --file-logs
Usage: Local Machine
Click to unfold
Export DataCite XML metadata records for a specific repository and/or DOI prefix to a local machine.
To write the records locally:
--destinationoption must be assigned tolocal--directory-pathoption must be assigned to a local existing directory
Example
- Return all DataCite records for the EnviDat repository (using client-id
ethz.wsl) - Write XML records locally
- Write XML records to a directory called "opendata/wsl"
datacite-websnap export --client-id ethz.wsl --destination local --directory-path "opendata/wsl"
Record Name Formatting
Click to unfold
Exported DataCite XML records are assigned file names (or S3 keys) using the DOI that corresponds to the record.
- The "/" slash character that divides the DOI prefix and suffix are replaced with a "_" underscore character
- ".xml" is appended to the DOI as a file extension
Example
Record DOI: 10.16904/envidat.31
File name (or S3 key) for exported record: 10.16904_envidat.31.xml
Logs
Click to unfold
Info messages and errors are logged to the console.
Optionally log messages errors can be written to a file log called by default "datacite-websnap.log".
To enable file logs the following option must be enabled: --file-logs
Example
datacite-websnap export --client-id ethz.wsl --bucket opendataswiss --file-logs
Configuration: Logs
Variables are assigned in config.py for logging configuration.
To override the default configuration variables related to logging the variables in the table below can be set in config.py.
LOG_NAME is the name of the file log (used if the --file-logs option is enabled).
Python logging basic configuration documentation.
| Configuration Variable | Default |
|---|---|
LOG_NAME |
"datacite-websnap.log" |
LOG_FORMAT |
"%(asctime)s | %(levelname)s | %(module)s.%(funcName)s:%(lineno)d | %(message)s" |
LOG_DATE_FORMAT |
"%Y-%m-%d %H:%M:%S" |
DataCite API
Click to unfold
datacite-websnap retrieves XML metadata records from the DataCite API.
Documentation for the DataCite API endpoints and pagination used in datacite-websnap:
Configuration: DataCite API
Default configuration variables are assigned in config.py for DataCite API base URL, endpoints, page size and timeout.
To override the default configuration variables related to DataCite the variables in the table below can be set in config.py.
| Configuration Variable | Default | Description |
|---|---|---|
TIMEOUT |
32 |
Timeout of API requests in seconds. |
DATACITE_API_URL |
https://api.datacite.org |
DataCite base URL used for API requests. Value is assigned as default to --api-url CLI option. |
DATACITE_API_CLIENTS_ENDPOINT |
/clients |
Endpoint used to retrieve client. |
DATACITE_API_DOIS_ENDPOINT |
/dois |
Endpoint used to retrieve list of DOIs. |
DATACITE_PAGE_SIZE |
250 |
Number of DOIs retrieved per page using pagination. Value is assigned as default to --page-size CLI option. |
Author
Rebecca Buchholz, EnviDat Software Engineer
EnviDat is the environmental data portal of the Swiss Federal Institute for Forest, Snow and Landscape Research WSL.
Inspiration
websnap
An EnviDat PyPI package that copies files retrieved from an API to an S3 bucket or a local machine.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacite_websnap-1.0.2.tar.gz.
File metadata
- Download URL: datacite_websnap-1.0.2.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.24.2 CPython/3.13.3 Linux/5.4.0-216-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd25c874d4b9323ab498afb915dd195068aec97e142cdfefa15a44940243e127
|
|
| MD5 |
4563c10fff4320b61141369230e82d49
|
|
| BLAKE2b-256 |
97e6d25c53a63059521fc9d36a553c67031289b1a3aa906bd0e54cd239e1d8ac
|
File details
Details for the file datacite_websnap-1.0.2-py3-none-any.whl.
File metadata
- Download URL: datacite_websnap-1.0.2-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.24.2 CPython/3.13.3 Linux/5.4.0-216-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c16e64da521f071f721a3b16d15a7aabf7527cad6f074066e32ecbb6b772d56
|
|
| MD5 |
1c2240087d99d37c01917954654a242d
|
|
| BLAKE2b-256 |
d4087abadaf438f8406f1f5798f80cc9c2c9fc8366315dd04a7a4d93054d56b6
|