Skip to main content

Python library to access OpenAlex Snapshot files

Project description

OpenAlex-RAW

This is a python module to process the OpenAlex dataset from the snapshot raw files available from the OpenAlex website.

Installation

To use the package you need to have a python (>=3.7) environment installed in your system. The package can be installed via pip or by downloading the source code from this repository.

Downloading the OpenAlex snapshot

If you did not already download the snapshot, you can follow the instalation instructions from the OpenAlex website in Download OpenAlex Snapshot to your machine. Here we provide a summary of the steps to download the dataset. Please, check the OpenAlex website for the most up to date instructions.

First, install the aws cli tool by following the instructions on the AWS-cli website.

Next, use the following command to download the snapshot:

aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request 

A new folder named openalex-snapshot will be created in your current working directory containing the dataset. Note that this process can take a long time as the dataset is over 300GB.

Installing the OpenAlex-RAW package

The package can be installed via pip by running the following command:

pip install openalex-raw

All the required packages are installed automatically.

You can also download the source code from this repository and install it manually. This can be done using git:

git clone https://github.com/filipinascimento/openalex-raw

Next, you need to install the package using pip or setup.py:

pip install -e ./openalex-raw

or

cd openalex-raw
python setup.py install

Usage RAW access

To go over all the entries of a certain type in the dataset, you can use the following code:

from pathlib import Path

# tqdm is used to print a nice progress bar
# install it using `pip install tqdm`
from tqdm.auto import tqdm

import openalexraw as oaraw

# Path to the OpenAlex snapshot
openAlexPath = Path("<Location of the OpenAlex Snapshot>")

# Initializing the OpenAlex object with the OpenAlex snapshot path
oa = oaraw.OpenAlex(
    openAlexPath = openAlexPath
)

# Which entity to process
# "works" | "authors" | "institutions" | "venues" | "concepts"
entityType = "works"

# Getting the number of entries
entitiesCount = oa.getRawEntityCount(entityType)

# Iterating over all the entities of a certain type
for entity in tqdm(oa.rawEntities(entityType),total=entitiesCount):
    openAlexID = entity["id"]
    # do something with the entity

On fast storage, it may take a couple of hours to iterate over all the entities for works or authors types. For institutions and venues, and concepts types, it may take just a few minutes.

Generating Schema and Report

Schemas and reports for each entity type can be found respectively in the folders Schema and Reports of this repository. Schema files are in machine-readable JSON format and contain all the fields and non-null counts, nested structures and lists are included. The reports show the number and percentage of the coverage of the fields in the dataset. Both Schema and Report files are named according to the OpenAlex entity type. Schema files also include the most common values (samples) for each field. Two schema files are provided: one with samples (e.g., Schema/schema_works_samples.json) and another without (e.g., Schema/schema_works.json).

To generate/update all the reports and schema, check the file Examples/create_report.py. Building the report can take a long time. You can use the provided schema files when generating dbgz archives.

Coming soon

  • Random access based on the OpenAlex ID via dbgz.
  • Better documentation for Schema/Report generators.

Full API documentation

The following is the documentation of the package's API.

class OpenAlex

    OpenAlex(
        openAlexPath,
        verbose = False
        ):

Class to access the OpenAlex data snapshots.

  • openAlexPath : str or pathlib.Path
    The path to the OpenAlex directory. (default: current working directory)
  • verbose : bool
    If True, print out more information. (default: False)

Returns

  • OpenAlex object The OpenAlex instance that can be used to access the dataset.

method getRawEntityCount

    OpenAlex.getRawEntityCount(entityType):

Get the number of raw entities of the given entity type.

  • entityType : str Entity type can be "authors", "concepts", "institutions", "venues" or "works".

Returns

  • int The number of entities for the provided entityType.

method rawEntities

    OpenAlex.rawEntities(entityType):

Iterate over the entities of the selected type directly from the raw snapshot.

  • entityType : str Entity type can be "authors", "concepts", "institutions", "venues" or "works".

Returns

  • iterable An iterable collection of entities of the provided entityType.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openalex-raw-0.3.1.tar.gz (34.0 kB view details)

Uploaded Source

File details

Details for the file openalex-raw-0.3.1.tar.gz.

File metadata

  • Download URL: openalex-raw-0.3.1.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.12

File hashes

Hashes for openalex-raw-0.3.1.tar.gz
Algorithm Hash digest
SHA256 293ced93995cf49adbe179d78bb62518415fb9e7e0c4181ec45dde5665050cc5
MD5 ac41981fd8eae8f2ce277e830b6f5314
BLAKE2b-256 383b6bd3a707a5795c46bd3e4d9b9633faae421d4805d2e40b1644eef8c39752

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page