Skip to main content

Python library to load get networks from the OpenAlex API

Project description

Open In Colab

OpenAlex Networks (openalexnet)

OpenAlex Networks is a helper library and standalone command-line application to process and obtain data from the OpenAlex dataset via API. It also provides functionality to generate citation and coauthorship networks from queries.

image

Installation

Install using pip

pip install openalexnet

or from source:

pip git+https://github.com/filipinascimento/openalexnet.git

Usage as command-line application

After installing openalexnet, you can use the command:

python -m openalexnet

or simply

openalexnet

This should print a help message with the available commands and options.

You can make your first query by using:

openalexnet -t works -f "author.id:A2420755856,is_paratext:false,type:journal-article" -s "complex" -r "cited_by_count:desc" -o works.jsonl -c citation_network.gml -a coauthorship_network.gml

This will get all the journal articles from H. Eugene Stanley (A2420755856) with the word "complex" and sorted by the number of citations (in descending order).

For more details about the interface, check the following sections.

Querying the OpenAlex API

The queries have four main parameters:

In addition to the query parameters, the user can provide the maximum number of entities to be retrieved by using the parameter maxentities (-m), set to 10000 by default. Use -1 to retrieve all entities. Example: -m 100 or -m -1.

Note that OpenAlex API recommends downloading and processing the snapshots of the dataset instead of using the API if you plan to download a large chunk of the complete dataset.

JSON Lines output

The output can be saved to a JSON Lines file (each line containing a JSON entry) by passing the argument --outputfile (-o). Example: -o works.jsonl.

Aggregating queries

It is also possible to combine several queries by providing a .csv or .tsv file with the queries. The file should have the following columns: filter, search, sort and maxentities. Missing columns will be filled with the default values. The output will have all the aggregated queries. Example: openalexnet -i queries.csv for a file queries.csv with the following content:

filter,search,sort,maximum_entities
"type:journal-article","""complex networks""","cited_by_count:desc",10000
"type:journal-article","""network science""","cited_by_count:desc",10000

This should retrieve the 10000 most cited works with the terms "complex networks" or "network science" using two different queries. The folder Examples/query_files/ provides more examples of query files.

Generating networks

The command-line application can also generate citation and coauthorship networks from the retrieved entities. The networks can be saved in 3 different formats: .edgelist, .gml, or .xnet. The citation network can be generated by providing the argument --citationfile (-c), with the parameter being the file path where the network should be saved. The extension of the file will determine the format. Example: -c citation_network.gml. Similarly, the coauthorship network can be generated by providing the argument --coauthorfile (-a). Example: -c citation_network.gml -a coauthorship_network.gml.

Attributes of works can be selected to be exported in the network by providing the argument --keptattributes (-k). The attributes should be comma-separated. Example: -n "id,title,doi".

By default the following properties are exported in the network:

id, doi, title, display_name, publication_year, publication_date, type, authorships, concepts, host_venue

The parameter --ignoreattributes (-g) can be used to ignore some of the default attributes. Example: -i "authorships,concepts,host_venue".

For the case of coauthorship networks, the user can provide two extra parameters:

  • --no_simplenetworks (-n): If enabled, the coauthorship network edges will not be aggregated, resulting in multiple edges. The default is disabled.
  • --countweights (-w) If enabled the coauthorship network will have non-normalized weights, i.e., the contribution of a paper to a connection weight is 1.0, otherwise the contribution is the inverse of the number of authors in the paper. The default is disabled.

if .edgelist format is used, extra csv files with the nodes and edges attributes will be generated with the same name as the network file, but with the extension _nodes.csv and _edges.csv.

Loading from saved JSON Lines files

The command-line application can also load the JSON Lines files generated by the API and generate the networks. This can be done by providing the argument --inputfile (-i). Example: -i works.jsonl -c citation_network.gml -a coauthorship_network.gml.

Polite mode

Finally, users can use the polite mode by providing an email address using --email (-e). See https://docs.openalex.org/how-to-use-the-api/ for more information.

Example usage

To obtain the works with the term"complex networks" (in abstracts, titles or fulltexts) sorted by the number of citations. This also generates gml files for the citation and coauthorship networks.

openalexnet -t works -f "type:journal-article" -s "complex networks" -r "cited_by_count:desc" -o works.jsonl -c citation_network.gml -a coauthorship_network.gml

Note that because maxentities is not provided, only the 10000 most cited works will be obtained.

To load the saved works.jsonl file and generate the networks:

openalexnet -t works -i works.jsonl -c citation_network.edgelist -a coauthorship_network.edgelist

Use a query file to retrieve works and save them to a JSON Lines file:

openalexnet -t works -q query.csv -o works.jsonl

Python Library Usage

Obtaining works from a specific author:

    filterData = {
        "author.id": "A2420755856", # Eugene H. Stanley
        "is_paratext": "false",  # Only works, no paratexts (https://en.wikipedia.org/wiki/Paratext)
        "type": "journal-article", # Only journal articles
        "from_publication_date": "2000-01-01" # Published after 2000
    }

    entityType = "works"

    openalex = oanet.OpenAlexAPI() # add your email to accelerate the API calls. See https://openalex.org/api

    entities = openalex.getEntities(entityType, filter=filterData)

    entitiesList = []
    for entity in tqdm(entities,desc="Retrieving entries"):
        entitiesList.append(entity)

    # Saving data as json lines (each line is a json object)
    oanet.saveJSONLines(entitiesList,"works_filtered.jsonl")

Check Examples folder for more examples.

Coming soon

  • Full API documentation
    • More examples
  • Unit tests
  • Group count

Google Colaboratory Demo/Tutorial

You can access a Google Colab demo and tutorial by using the following link. Open In Colab

Thanks

Remember to cite the OpenAlex work:

@article{priem2022openalex,
  title={OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts},
  author={Priem, Jason and Piwowar, Heather and Orr, Richard},
  journal={arXiv preprint arXiv:2205.01833},
  year={2022}
}

If you use this code, please give it a star and share with your coleagues. Also stay tuned as I plan to develop a web-based interface for dynamic visualization of openalex networks. Check out Helios-Web to see the development progress of our network visualization tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openalexnet-0.1.2.tar.gz (18.5 kB view details)

Uploaded Source

File details

Details for the file openalexnet-0.1.2.tar.gz.

File metadata

  • Download URL: openalexnet-0.1.2.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for openalexnet-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c8574b55bcb87612221625934679606e54a8dd7a1b56ef8ff0015bece420cf42
MD5 9afcfbdc992dcc64ba42aa06c1881c7d
BLAKE2b-256 55bff1a0db347a3afcd3ed97b38eea2a94951f82c36b2e88e184106878eae0f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page