Skip to main content

Python library to access OpenAlex Snapshot files

Project description

OpenAlex-RAW

This is a python module to process the OpenAlex dataset from the snapshot raw files available from the OpenAlex website.

Installation

To use the package you need to have a python (>=3.7) environment installed in your system. The package can be installed via pip or by downloading the source code from this repository.

Downloading the OpenAlex snapshot

If you did not already download the snapshot, you can follow the instalation instructions from the OpenAlex website in Download OpenAlex Snapshot to your machine. Here we provide a summary of the steps to download the dataset. Please, check the OpenAlex website for the most up to date instructions.

First, install the aws cli tool by following the instructions on the AWS-cli website.

Next, use the following command to download the snapshot:

aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request 

A new folder named openalex-snapshot will be created in your current working directory containing the dataset. Note that this process can take a long time as the dataset is over 300GB.

Installing the OpenAlex-RAW package

The package can be installed via pip by running the following command:

pip install openalex-raw

All the required packages are installed automatically.

You can also download the source code from this repository and install it manually. This can be done using git:

git clone https://github.com/filipinascimento/openalex-raw

Next, you need to install the package using pip or setup.py:

pip install -e ./openalex-raw

or

cd openalex-raw
python setup.py install

Usage RAW access

To go over all the entries of a certain type in the dataset, you can use the following code:

from pathlib import Path

# tqdm is used to print a nice progress bar
# install it using `pip install tqdm`
from tqdm.auto import tqdm

import openalexraw as oaraw

# Path to the OpenAlex snapshot
openAlexPath = Path("/gpfs/sciencegenome/OpenAlex/openalex-snapshot")

# Path to where to save the schema files
schemasPath = Path("Schema")

# Initializing the OpenAlex object with the OpenAlex snapshot path
oa = oaraw.OpenAlex(
    openAlexPath = openAlexPath
)

# Creating any necessary directories
schemasPath.mkdir(parents=True, exist_ok=True)

# Which entity to process
# "works" | "authors" | "institutions" | "venues" | "concepts"
entityType = "works"

# Getting the number of entries
entitiesCount = oa.getRawEntityCount(entityType)

# Iterating over all the entities of a certain type
for entity in tqdm(oa.rawEntities(entityType),total=entitiesCount):
    openAlexID = entity["id"]
    # do something with the entity

On a fast storage, it may take a couple of hours to iterate over all the entities for works or ```authorstypes. Forinstitutions` and `venues`, and `concepts` types, it may take just a few minutes.

Generating Schema and Report

Reports for each entity type can be found in the folder Schema of this repository. To generate/update all the reports, check the file Examples/create_report.py in the repository.

Coming soon

  • Random access based on the OpenAlex ID via dbgz.
  • Better documentation for Schema/Report generators.

Full API documentation

The following is the documentation of the package's API.

class OpenAlex

    OpenAlex(
        openAlexPath,
        verbose = False
        ):

Class to access the OpenAlex data snapshots.

  • openAlexPath : str or pathlib.Path
    The path to the OpenAlex directory. (default: current working directory)
  • verbose : bool
    If True, print out more information. (default: False)

Returns

  • OpenAlex object The OpenAlex instance that can be used to access the dataset.

method getRawEntityCount

    OpenAlex.getRawEntityCount(entityType):

Get the number of raw entities of the given entity type.

  • entityType : str Entity type can be "authors", "concepts", "institutions", "venues" or "works".

Returns

  • int The number of entities for the provided entityType.

method rawEntities

    OpenAlex.rawEntities(entityType):

Iterate over the entities of the selected type directly from the raw snapshot.

  • entityType : str Entity type can be "authors", "concepts", "institutions", "venues" or "works".

Returns

  • iterable An iterable collection of entities of the provided entityType.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openalex-raw-0.1.5.tar.gz (6.7 kB view details)

Uploaded Source

File details

Details for the file openalex-raw-0.1.5.tar.gz.

File metadata

  • Download URL: openalex-raw-0.1.5.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.12

File hashes

Hashes for openalex-raw-0.1.5.tar.gz
Algorithm Hash digest
SHA256 68d6a435632b583d1ec450f617b91e0438eedcebe4d36a3bc405bfebfd9383fe
MD5 2fe5b1720e8f9d0cbc3fc8dcdc71cb1e
BLAKE2b-256 d0ab278d4ccaad3f2ddbf0d5a79f582c37635ce2d982998a6253937d0812f65c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page