Skip to main content

A python library to download, analyse or plot articles, institutions, and other entities from the OpenAlex API

Project description

OpenAlex Analysis

A python library to download, analyse or plot articles, institutions, and other entities from the OpenAlex API

Install with:

pip install openalex-analysis

Documentation: https://romain894.github.io/openalex-analysis

Licence: GPL V3

Examples

Examples in the documentation: https://romain894.github.io/openalex-analysis/html/example_works_concepts.html

Get a dataset

You can use the library simply to download and work with datasets from OpenAlex. The library can download these datasets and cache them on the computer automatically.

These datasets can then be used in python outside the library, as they are pandas DataFrame objects.

It is possible to save the datasets as a CSV file with the pandas function df.to_csv("my_dataset.csv").

Bellow, a few examples:

Get works from a concept

Get the works about regime shift and save them in a CSV file:

from openalex_analysis.analysis import WorksAnalysis

concept_regime_shift_id = 'C2780893879'

wplt = WorksAnalysis(concept_regime_shift_id)

my_dataset = wplt.entities_df

my_dataset.to_csv("dataset_regime_shift_works.csv")

Get the works about sustainability from the Stockholm Resilience Centre published in 2020

from openalex_analysis.analysis import WorksAnalysis

concept_sustainability = 'C66204764'
institution_src_id = "I138595864"
extra_filters = {
    'publication_year':2020,
    'authorships':{'institutions':{'id':institution_src_id}},
}

wplt = WorksAnalysis(concept_sustainability,
                     extra_filters = extra_filters)

my_dataset = wplt.entities_df

Institutions collaborations plot

from openalex_analysis.plot import InstitutionsPlot

year = "2023"

entities_from = [
                     "I163151358",  # Cyprus University of Technology
                     "I107257983",  # Darmstadt University of Applied Sciences
                     "I201787326",  # Riga Technical University
                     "I4210144925", # Technological University Dublin
                     "I31151848",   # Technical University of Sofia
                     "I3123212020", # Universidad Politécnica de Cartagena
                     "I140494188",  # University of Technology of Troyes
                     "I158333966",  # Technical University of Cluj-Napoca
                     "I186995768",  # University of Cassino and Southern Lazio
                     ]

iplt = InstitutionsPlot()

# generate the DataFrame
iplt.get_collaborations_with_institutions(entities_from = entities_from,
                                          year = year,
                                         )

# generate the plot
fig = iplt.get_figure_collaborations_with_institutions()

fig.write_image("plot_collaborations_eut+_2023.svg")
fig.write_image("plot_collaborations_eut+_2023.png", scale=2)
fig.write_html("plot_collaborations_eut+_2023.html")
 
fig.show()

Plot of the collaborations of the EUt+ universities in 2023 Click on the plot to go to notebook with the interactive view.

Basic analysis

In the example, we create a dataset with the works about sustainability.

This dataset can be used as it, it is stored in a parquet file (more optimized than CSV) on the computer and can be simply imported as a dataframe with Pandas.

After getting this dataset, we continue by extracting the most cited articles by the dataset. For that, we extract all the references of the articles present in the dataset and rank these references.

from openalex_analysis.plot import WorksPlot

concept_sustainability_id = 'C66204764'

# get the works about sustainability
wplt = WorksPlot(concept_sustainability_id)

print("\nFirst entities in the dataset:")
print(wplt.entities_df[['id', 'title']].head(3))

# compute the most cited works by the dataset previously downloaded
wplt.create_element_used_count_array('reference')

print("\nMost cited work within the dataset:")
print(wplt.element_count_df.head(3))
Loading dataframe of works of the concept C66204764
Loading the list of entities from a parquet file...

First entities in the dataset:
                                 id                                              title
0  https://openalex.org/W2101946146  Asset Stock Accumulation and Sustainability of...
1  https://openalex.org/W1999167944  Planetary boundaries: Guiding human developmen...
2  https://openalex.org/W2122266551  Agricultural sustainability and intensive prod... 

Getting name of C66204764 from the OpenAlex API (cache disabled)...
Creating the works references count of works C66204764...

Most cited work within the dataset:
                                  C66204764 Sustainability
element                                                   
https://openalex.org/W2026816730                       262
https://openalex.org/W2096885696                       249
https://openalex.org/W2103847341                       203

Concepts yearly count

In this example, we will create two datasets: one with the articles about sustainability of the SRC (Stockholm Resilience Centre) and one with the articles about sustainability of the UTT (University of Technology of Troyes).

We will then plot the yearly usage of the concept sustainability by these institutions (in this case it's equal to the number of articles in the dataset, as the dataset contains only the articles about sustainability).

We could also plot the yearly usage of other concepts or of the references by changing the parameters of the functions create_element_used_count_array() and get_figure_time_series_element_used_by_entities().

from openalex_analysis.plot import WorksPlot

concept_sustainability_id = 'C66204764'
# create the filter for the API to get only the articles about sustainability
sustainability_concept_filter = {"concepts": {"id": concept_sustainability_id}}

# set the years we want to count
count_years = list(range(2004, 2024))

institution_ids_list = ["I138595864", "I140494188"]
institution_names_list = ["Stockholm Resilience Centre", "University of Technology of Troyes"]

# create a list of dictionaries with each dictionary containing the ID, name and filter for each institution
entities_ref_to_count = [None] * len(institution_ids_list)
for i in range(len(institution_ids_list)):
    entities_ref_to_count[i] = {'entity_from_id': institution_ids_list[i],
                                'extra_filters': sustainability_concept_filter}

wplt = WorksPlot()
wplt.create_element_used_count_array('concept', entities_ref_to_count, count_years = count_years)

wplt.add_statistics_to_element_count_array(sort_by = 'sum_all_entities')

wplt.get_figure_time_series_element_used_by_entities().write_image("Plot_yearly_usage_sustainability_SRC_UTT.svg", width=900, height=350)

wplt.get_figure_time_series_element_used_by_entities()

Plot of the yearly usage of sustainability by SRC and UTT

Configure the library

By default, the library will run out of the box. Nevertheless, some optional configurations can be done to improve the performance and to fit best the use case.

Setting up the email address allows you to use the polite pool from OpenAlex which is faster than the default one.

from openalex_analysis.plot import config, InstitutionsPlot

config.email = "email@example.com"

InstitutionsPlot()

The documentation or this notebook setup_and_settings.ipynb contains more setup examples.

Default settings

config.email = None
config.api_key = None
config.openalex_url = "https://api.openalex.org"
config.disable_tqdm_loading_bar = False
config.n_max_entities = 10000
config.project_datas_folder_path = "data"·
config.parquet_compression = "brotli"
config.max_storage_percent = 95
  • email The email address is need to access the polite pool from OpenAlex which is faster than the default one.

  • api_key Optional, if you have one from OpenAlex

  • openalex_url OpenAlex URL

  • disable_tqdm_loading_bar If set to True, it will disable the loading bar in the terminal output when downloading data from the OpenAlex API.

  • n_max_entities When downloading a list of entities from the API (eg a list of works), the maximum number of entities to download. Set to None to have no limitation. This number must be a multiple of 200 (the is the number of element per page used by the library)

  • project_datas_folder_path Path to store the data downloaded from the API. The data will be stored as parquet files, with each file corresponding to one request.

  • parquet_compression By default, the parquet files are compressed. The compression can be disabled by setting with parquet_compression = None. For other parquet compression algorithms, see the pandas documentation. Compressing reduces by 2 to 10 the file size while needing a negligeable time to compress or decompress. Disabling the compression is usefull if you want to read the parquet files with an external software.

  • max_storage_percent Maximum storage usage percentage on the disk before starting to delete data stored in project_datas_folder_path. The parquet file with the oldest last read data will be deleted first.

Tests

In the directory tests/ run:

pytest tests.py

Build the documentation

In the directory sphinx-doc/ run:

make html

Other resources

Romain Thomas 2024

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openalex_analysis-0.12.1.tar.gz (63.0 kB view hashes)

Uploaded Source

Built Distribution

openalex_analysis-0.12.1-py3-none-any.whl (47.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page