Python library to access OpenAlex Snapshot files
Project description
OpenAlex-RAW
This is a python module to process the OpenAlex dataset from the snapshot raw files available from the OpenAlex website.
Installation
To use the package you need to have a python (>=3.7
) environment installed in your system. The package can be installed via pip
or by downloading the source code from this repository.
Downloading the OpenAlex snapshot
If you did not already download the snapshot, you can follow the instalation instructions from the OpenAlex website in Download OpenAlex Snapshot to your machine. Here we provide a summary of the steps to download the dataset. Please, check the OpenAlex website for the most up to date instructions.
First, install the aws cli
tool by following the instructions on the AWS-cli website.
Next, use the following command to download the snapshot:
aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request
A new folder named openalex-snapshot
will be created in your current working directory containing the dataset. Note that this process can take a long time as the dataset is over 300GB.
Installing the OpenAlex-RAW package
The package can be installed via pip by running the following command:
pip install openalex-raw
All the required packages are installed automatically.
You can also download the source code from this repository and install it manually. This can be done using git
:
git clone https://github.com/filipinascimento/openalex-raw
Next, you need to install the package using pip
or setup.py
:
pip install -e ./openalex-raw
or
cd openalex-raw
python setup.py install
Usage RAW access
To go over all the entries of a certain type in the dataset, you can use the following code:
from pathlib import Path
# tqdm is used to print a nice progress bar
# install it using `pip install tqdm`
from tqdm.auto import tqdm
import openalexraw as oaraw
# Path to the OpenAlex snapshot
openAlexPath = Path("<Location of the OpenAlex Snapshot>")
# Initializing the OpenAlex object with the OpenAlex snapshot path
oa = oaraw.OpenAlex(
openAlexPath = openAlexPath
)
# Which entity to process
# "works" | "authors" | "institutions" | "venues" | "concepts"
entityType = "works"
# Getting the number of entries
entitiesCount = oa.getRawEntityCount(entityType)
# Iterating over all the entities of a certain type
for entity in tqdm(oa.rawEntities(entityType),total=entitiesCount):
openAlexID = entity["id"]
# do something with the entity
On fast storage, it may take a couple of hours to iterate over all the entities for works
or authors
types. For institutions
and venues
, and concepts
types, it may take just a few minutes.
Generating Schema and Report
Schemas and reports for each entity type can be found respectively in the folders Schema
and Reports
of this repository. Schema files are in machine-readable JSON format and contain all the fields and non-null counts, nested structures and lists are included. The reports show the number and percentage of the coverage of the fields in the dataset. Both Schema and Report files are named according to the OpenAlex entity type. Schema files also include the most common values (samples) for each field. Two schema files are provided: one with samples (e.g., Schema/schema_works_samples.json
) and another without (e.g., Schema/schema_works.json
).
To generate/update all the reports and schema, check the file Examples/create_report.py
. Building the report can take a long time. You can use the provided schema files when generating dbgz
archives.
Coming soon
- Random access based on the OpenAlex ID via
dbgz
. - Better documentation for Schema/Report generators.
Full API documentation
The following is the documentation of the package's API.
class OpenAlex
OpenAlex(
openAlexPath,
verbose = False
):
Class to access the OpenAlex data snapshots.
openAlexPath
:str
orpathlib.Path
The path to the OpenAlex directory. (default: current working directory)verbose
:bool
If True, print out more information. (default: False)
Returns
OpenAlex
object The OpenAlex instance that can be used to access the dataset.
method getRawEntityCount
OpenAlex.getRawEntityCount(entityType):
Get the number of raw entities of the given entity type.
entityType
:str
Entity type can be"authors"
,"concepts"
,"institutions"
,"venues"
or"works"
.
Returns
int
The number of entities for the providedentityType
.
method rawEntities
OpenAlex.rawEntities(entityType):
Iterate over the entities of the selected type directly from the raw snapshot.
entityType
:str
Entity type can be"authors"
,"concepts"
,"institutions"
,"venues"
or"works"
.
Returns
iterable
An iterable collection of entities of the providedentityType
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file openalex-raw-0.3.1.tar.gz
.
File metadata
- Download URL: openalex-raw-0.3.1.tar.gz
- Upload date:
- Size: 34.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 293ced93995cf49adbe179d78bb62518415fb9e7e0c4181ec45dde5665050cc5 |
|
MD5 | ac41981fd8eae8f2ce277e830b6f5314 |
|
BLAKE2b-256 | 383b6bd3a707a5795c46bd3e4d9b9633faae421d4805d2e40b1644eef8c39752 |