A flexible and lightweight Python interface to the DataCite database
Project description
pytacite
Pytacite is a Python library for DataCite. DataCite is a non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. It holds a large index of metadata of outputs. DataCite offers an open and free REST API to query metadata. Pytacite is a lightweight and thin Python interface to this API. Pytacite aims to stay as close as possible to the design of the original service.
The following features of DataCite are currently supported by pytacite:
- Get single entities
- Filter and query entities
- Sort entities
- Sample entities
- Pagination
- Usage reports
- Authentication
- Side-load associations with include
We aim to cover the entire API, and we are looking for help. We are welcoming Pull Requests.
Key features
- Pipe operations - pytacite can handle multiple operations in a sequence. This allows the developer to write understandable queries. For examples, see code snippets.
- Permissive license - DataCite data is CC0 licensed :raised_hands:. pytacite is published under the MIT license.
Installation
pytacite requires Python 3.8 or later.
pip install pytacite
Getting started
Pytacite offers support for: DOIs, Clients, ClientPrefixes, Events, Prefixes, Providers, and ProviderPrefixes.
from pytacite import DOIs, Clients, Events, Prefixes, ClientPrefixes, Providers, ProviderPrefixes
Get single entity
Get a single DOI, Event, Prefix, ProviderPrefix from DataCite by the id.
DOIs()["10.14454/FXWS-0523"]
The result is a DOI
object, which is very similar to a dictionary. Find the
available fields with .keys()
. Most interesting attributes are stored in
the "attributes"
field.
For example, get the titles:
DOIs()["10.14454/FXWS-0523"]["attributes"]["titles"]
[{'title': 'DataCite Metadata Schema for the Publication and Citation of Research Data and Other Research Outputs v4.4'}]
It works similar for other resource collections.
Prefixes()["10.12682"]
Events()["9a34e232-5b30-453b-a393-ea10a6ce565d"]
Get lists of entities
results = DOIs().get()
For lists of entities, you can also count
the number of records found
instead of returning the results. This also works for search queries and
filters.
DOIs().count()
# 50869984
For lists of entities, you can return the result as well as the metadata. By default, only the results are returned.
results, meta = DOIs().get(return_meta=True)
print(meta)
{'total': 50869984,
'totalPages': 400,
'page': 1,
'states': [{'id': 'findable', 'title': 'Findable', 'count': 50869984}],
'resourceTypes': [{'id': 'dataset', 'title': 'Dataset', 'count': 15426144}, <...>]
<...>
'subjects': [{'id': 'FOS: Biological sciences',
'title': 'Fos: Biological Sciences',
'count': 3304486}, <...>],
'citations': [],
'views': [],
'downloads': []}
Filters and queries
DataCite makes use of filter and queries. Filters can narrow down queries (~.~)
and queries can help to search fields. See:
- Filtering: https://support.datacite.org/docs/api-queries#filtering-list-responses
- Making Queries: https://support.datacite.org/docs/api-queries#making-queries
The following example returns records created in the year 2020 on Dryad.
DOIs().filter(created=2020, client_id="dryad.dryad").get()
which is identical to:
DOIs().filter(created=2020).filter(client_id="dryad.dryad").get()
Queries can work in a similar fashion and can be applied to all fields. For example, search for records with climate change
in the title.
DOIs().query("climate change").get()
Important to note, this returns a list of all the DOI records that contain the phrases climate
and change
in their metadata (potential mistake in DataCite documentation).
Nested attribute filters
Some attribute filters are nested and separated with dots by DataCite. For
example, filter on creators.nameIdentifiers.nameIdentifierScheme
.
In case of nested attribute filters, use a dict to build the query.
DOIs() \
.query(creators={"nameIdentifiers": {"nameIdentifierScheme": "ORCID"}}) \
.query(publicationYear=2016) \
.query(language="es") \
.count()
# 562
Sort entity lists
Clients().sort("created", ascending=True).get()
Logical expressions
See DataCite on logical operators like AND, OR, and NOT.
Paging
DataCite offers two methods for paging: basic paging and cursor paging. Both methods are supported by pytacite.
Basic (offset) paging
Only the first 10,000 records can be retrieved with basic (offset)paging.
pager = DOIs().filter(prefix="10.5438").paginate(method="number", per_page=100)
for page in pager:
print(len(page))
Cursor paging
Use paginate()
for paging results. By default, paginate
s argument n_max
is set to 10000. Use None
to retrieve all results.
pager = DOIs().filter(prefix="10.5438").paginate(per_page=100)
for page in pager:
print(len(page))
Looking for an easy method to iterate the records of a pager?
from itertools import chain
from pytacite import DOIs
query = DOIs().filter(prefix="10.5438")
for record in chain(*query.paginate(per_page=100)):
print(record["id"])
Get random DOIs
Get random DOIs. Somehow, this has very slow response times (caused by DataCite).
DOIs().random().get(per_page=10)
Code snippets
A list of awesome use cases of the DataCite dataset.
Creators of a dataset
from pytacite import DOIs
w = DOIs()["10.34894/HE6NAQ"]
w["attributes"]["creators"]
Get the works of a single creator
Work in progress: get rid of quotes.
DOIs() \
.query(creators={"nameIdentifiers": {"nameIdentifier": "\"https://orcid.org/0000-0001-7736-2091\""}}) \
.get()
Software published on Zenodo in 2016
Resources:
Get the DataCite identifier of the client first:
from pytacite import Clients
c = Clients().query("Zenodo").get()
print(c[0]["id"])
# cern.zenodo
Filter the DOIs on the client identifier. It can be a bit confusing when to use filter
and query
here.
DOIs() \
.filter(client_id=c[0]["id"]) \
.filter(resource_type_id="software") \
.query(publicationYear=2016) \
.get()
# 9720
Number of repositories running on Dataverse software
from pytacite import Clients
Clients() \
.filter(software="dataverse") \
.count()
# 31
Alternatives
datacite is a nice Python wrapper for Metadata Store API which is not covered by pytacite.
R users can use RDataCite library.
License
Contact
This library is a community contribution. The authors of this Python library aren't affiliated with DataCite.
Feel free to reach out with questions, remarks, and suggestions. The issue tracker is a good starting point. You can also email me at jonathandebruinos@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytacite-0.5.1.tar.gz
.
File metadata
- Download URL: pytacite-0.5.1.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bece995553e23254e6c5d9f864f67514e886fa71a1897b0b3ca9167fb2fe53b0 |
|
MD5 | 430cedac1bfc08febf958d35c79314de |
|
BLAKE2b-256 | 77c6a4d7d84a255a7793d35c191e02754df03003006eacfbd2342cc057d945ff |
File details
Details for the file pytacite-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: pytacite-0.5.1-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 087396e691b478336ae8bed4da0412c92cd0e4ad746a1213255577bcf5db08c4 |
|
MD5 | b8a5fcdf2d2c5af7877ec72b08642b8b |
|
BLAKE2b-256 | cc48f0d7263909ec80b7819846babaa33ab7a909de8a454e03957e727b4452fa |