Skip to main content

Efficient relational database queries over the entire Crossref abnd ORCID data sets

Project description

Alexandria3k

The alexandria3k package supplies a library and a command-line tool providing efficient relational query access the entire Crossref data set (157 GB compressed, 1 TB uncompressed). This contains publication metadata from about 134 million publications from all major international publishers with full citation data for 60 million of them. The alexandria3k package installation contains all elements required to run it. It does not require the installation, configuration, and maintenance of a third party relational or graph database. It can therefore be used out-of-the-box for performing reproducible publication research on the desktop.

In addition, the Crossref data set can be linked with the ORCID summary data set (25 GB compressed, 435 GB uuncompressed), containing about 78 million author records, as well as data sets of funder bodies, journal names, open access journals, and research organizations.

Installation

The easiest way to install alexandria3k and its dependencies is through PyPI:

python3 -m pip install --use-pep517 alexandria3k

or

pip install --use-pep517 alexandria3k

Data downloading

Crossref data

The main data set on which alexandria3k operates is that of Crossref, comprising about 130 million publication metadata records with more than 1.7 billion reference records and 359 million author records. You can obtain the data from Acedemic Torrents using a torrent client of your choice. Below is an example of commands that download the April 2022 Crossref snapshot (the one on which alexandria3k has been tested) using the aria2c download utility.

# Download Crossref data (168 GB) through the torrent link
aria2c https://doi.org/10.13003/83b2gq

Currently, the Crossref data set is split into about 26 thousand compressed files, each containing JSON data for 3000 publications (works). Alexandria3k provides a relational view of these data, and also allows the sampling of a subset of the container files to quickly experiment with queries, before they are run on the complete set.

ORCID data

You can populate a database with data regarding authors (URLs, countries, external identifiers, education, employment, works, etc.) from the ORCID initiative. For this you need to download the summary file of the ORCID Public Data File (e.g. ORCID_2022_10_summaries.tar.gz — 25GB) through Figshare. Note that alexandria3k works on the compressed file; there is no need to expand it (it expands to about 0.5 PB).

Other data sources

The alexandria3k system can also add to a database the following tables. You can download the data and point alexandria3k to the corresponding file, or, because the data sets are relatively small, you can provide alexandria3k the URL from which it will directly stream the data to populate the database.

Use overview

After downloading the Crossref data you can use alexandria3k through its Python API or as a command-line tool. These are the things you can do with alexandria3k.

  • Directly run ad hoc SQL queries on the Crossref data
  • Populate SQLite databases with Crossref, ORCID, DOAJ, and other data
    • Select a horizontal subset of Crossref records
      • Through an SQL expression
      • By sampling a subset of the 26 thousand containers in the data set
    • Select a horizontal subset of ORCID records by only loading those associated with already populated Crossref records
    • Select a vertical subset of Crossref or ORCID columns
      • Using the Table.Column or Table.* notation

Populating a database can take minutes (for a small, e.g. experimental, subset), a few hours (to traverse the whole Crossref data set and obtain a few thousands of records), or a couple of days (to produce a large set, e.g. by selecting some columns).

After your populate an SQLite database and create suitable indexes, SQL queries often run in seconds.

Relational schema

The complete schema of a fully-populated database appears below. Through this link you can see the schemata of individual data sets. You can also follow the linked image to the "Raw" file to see it in full size. Direct SQL queries can also be performed on the Crossref data set.

Database schema

Queries involving multiple scans of the tables (e.g. relational joins) should be performed by directing alexandria3k to perform them separately in each partition. This however means that aggregation operations will not work as expected, because they will be run multiple times (once for every partition).

Command line execution

You can find many example of studies conducted with command-line invocations in the examples directory. Consider using the hello world example as a starting point.

Obtain list of command-line options

alexandria3k --help

Show DOI and title of all publications

alexandria3k --data-source Crossref 'April 2022 Public Data File from Crossref' \
   --query 'SELECT DOI, title FROM works'

Save DOI and title of 2021 publications in a CSV file suitable for Excel

alexandria3k --data-source Crossref 'April 2022 Public Data File from Crossref' \
  --query 'SELECT DOI, title FROM works WHERE published_year = 2021' \
  --output 2021.csv \
  --output-encoding use utf-8-sig

Count Crossref publications by year and type

This query performs a single pass through the data set to obtain the number of Crossref publications by year and publication type.

alexandria3k --data-source Crossref 'April 2022 Public Data File from Crossref' \
   --query-file count-year-type.sql >results.csv

where count-year-type.sql contains:

WITH counts AS (
  SELECT
    published_year AS year,
    type,
    Count(*) AS number
FROM   works
    GROUP by published_year, type)

SELECT year AS name, Sum(number) FROM counts
  GROUP BY year
UNION
SELECT type AS name, Sum(number) FROM counts
  GROUP BY type

Sampling

The following command counts the number of publication that have or do not have an abstract in an approximately 1% sample of the data set's containers. It uses a tab character (\t) to separate the output fields. Through sampling the data containers it runs in a couple of minutes, rather than hours.

alexandria3k --data-source Crossref 'April 2022 Public Data File from Crossref'  \
   --sample 'random.random() < 0.01' \
   --field-separator $'\t' \
   --query-file count-no-abstract.sql

where count-no-abstract.sql contains:

SELECT works.abstract is not null AS have_abstract, Count(*)
  FROM works GROUP BY have_abstract

For quick experiments, e.g. for verifying the queries of a full run, consider sampling just three containers with --sample 'random.random() < 0.0002'.

Database of COVID research

The following command creates an SQLite database with all Crossref data regarding publications that contain "COVID" in their title or abstract.

alexandria3k --data-source Crossref 'April 2022 Public Data File from Crossref' \
   --populate-db-path covid.db \
   --row-selection "title like '%COVID%' OR abstract like '%COVID%' "

Publications graph

The following command selects only a subset of columns of the complete Crossref data set to create a graph between navigable entities.

alexandria3k --data-source Crossref 'April 2022 Public Data File from Crossref' \
   --populate-db-path graph.db \
   --columns works.doi work_references.work_id work_references.doi work_funders.id \
    work_funders.work_id work_funders.doi funder_awards.funder_id funder_awards.name \
    author_affiliations.author_id author_affiliations.name work_subjects.work_id work_subjects.name \
    work_authors.id work_authors.work_id work_authors.orcid

Through this data set you can run on the database queries such as the following.

SELECT COUNT(*) FROM works;
SELECT COUNT(*) FROM (SELECT DISTINCT work_id FROM works_subjects);
SELECT COUNT(*) FROM (SELECT DISTINCT work_id FROM work_references);
SELECT COUNT(*) FROM affiliations_works;
SELECT COUNT(*) FROM (SELECT DISTINCT work_id FROM work_funders);

SELECT COUNT(*) FROM work_authors;
SELECT COUNT(*) FROM work_authors WHERE orcid is not null;
SELECT COUNT(*) FROM (SELECT DISTINCT orcid FROM work_authors);

SELECT COUNT(*) FROM authors_affiliations;
SELECT COUNT(*) FROM affiliation_names;

SELECT COUNT(*) FROM works_subjects;
SELECT COUNT(*) FROM subject_names;

SELECT COUNT(*) FROM work_funders;
SELECT COUNT(*) FROM funder_awards;

SELECT COUNT(*) FROM work_references;

Record selection from external database

The following command creates an SQLite database with all Crossref data of works whose DOI appears in the attached database named selected.db.

alexandria3k --data-source Crossref 'April 2022 Public Data File from Crossref' \
   --populate-db-path selected-works.db \
   --attach-databases 'attached:selected.db' \
   --row-selection "EXISTS (SELECT 1 FROM attached.selected_dois WHERE works.doi = selected_dois.doi)"

Populate the database with author records from ORCID

Only records of authors identified in the publications through an ORCID will be added.

alexandria3k --populate-db-path database.db \
  --data-source ORCID ORCID_2022_10_summaries.tar.gz \
  --linked-records persons

Populate the database with journal names

alexandria3k --populate-db-path database.db \
  --data-source journal-names http://ftp.crossref.org/titlelist/titleFile.csv

Populate the database with funder names

alexandria3k --populate-db-path database.db \
  --data-source funder-names https://doi.crossref.org/funderNames?mode=list

Work with Scopus All Science Journal Classification Codes (ASJC)

# Populate database with ASJCs
alexandria3k --populate-db-path database.db --data-source ASJC

# Link the (sometime previously populated works table) with ASJCs
alexandria3k --populate-db-path database.db --execute link-works-asjcs

Populate the database with data regarding open access journals

alexandria3k --populate-db-path database.db \
  --data-source DOAJ https://doaj.org/csv

Populate the database with the names of research organizations

Populate the research organization registry (ROR) tables.

# Fetch the ROR data file (~21 MB)
wget -O ror-v1.17.1.zip \
  "https://zenodo.org/record/7448410/files/v1.17.1-2022-12-16-ror-data.zip?download=1"

# Populate the database
alexandria3k --populate-db-path database.db \
  --data-source ROR ror-v1.17.1.zip

Link author affiliations with research organization names

Given a database already populated with work author affiliations and the research organization registry fill-in the table work_authors_rors linking the two.

# Link affiliations with best match
alexandria3k --populate-db-path database.db --execute link-aa-base-ror

# Link affiliations with top parent of best match
alexandria3k --populate-db-path database.db --execute link-aa-top-ror

After linking, the results' quality can be verified with queries such as the following.

-- Display affiliation matches
SELECT author_affiliations.name, research_organizations.name FROM
  work_authors
  INNER JOIN author_affiliations
    ON work_authors.id = author_affiliations.author_id
  INNER JOIN work_authors_rors
    ON work_authors_rors.work_author_id = work_authors.id
  INNER JOIN research_organizations
    ON research_organizations.id = work_authors_rors.ror_id;

-- Display unmatched affiliations
SELECT author_affiliations.name FROM
  work_authors
  INNER JOIN author_affiliations
    ON work_authors.id = author_affiliations.author_id
  LEFT JOIN work_authors_rors
    ON work_authors_rors.work_author_id = work_authors.id
  WHERE work_authors_rors.ror_id is null;

Command-line options reference

usage: alexandria3k [-h] [-a ATTACH_DATABASES [ATTACH_DATABASES ...]]
                    [-c COLUMNS [COLUMNS ...]] [-D DEBUG [DEBUG ...]]
                    [-d DATA_SOURCE [DATA_SOURCE ...]] [-E OUTPUT_ENCODING]
                    [-F FIELD_SEPARATOR] [-H] [-i [INDEX ...]]
                    [-L LIST_SCHEMA] [-l LINKED_RECORDS] [-n] [-o OUTPUT] [-P]
                    [-p POPULATE_DB_PATH] [-Q QUERY_FILE] [-q QUERY]
                    [-R ROW_SELECTION_FILE] [-r ROW_SELECTION] [-s SAMPLE]
                    [-x EXECUTE]

alexandria3k: Publication metadata interface

optional arguments:
  -h, --help            show this help message and exit
  -a ATTACH_DATABASES [ATTACH_DATABASES ...], --attach-databases ATTACH_DATABASES [ATTACH_DATABASES ...]
                        Databases to attach for the row selection query
  -c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
                        Columns to populate using table.column or table.*
  -D DEBUG [DEBUG ...], --debug DEBUG [DEBUG ...]
                        Output debuggging information as specfied by the
                        arguments. files-read: Output counts of data files
                        read; link: Record linking operations; log-sql: Output
                        executed SQL statements; perf: Output performance
                        timings; populated-counts: Dump counts of the
                        populated database; populated-data: Dump the data of
                        the populated database; populated-reports: Output
                        query results from the populated database; progress:
                        Report progress; stderr: Log to standard error;
                        virtual-counts: Dump counts of the virtual database;
                        virtual-data: Dump the data of the virtual database.
  -d DATA_SOURCE [DATA_SOURCE ...], --data-source DATA_SOURCE [DATA_SOURCE ...]
                        Specify data set to be processed and its source. The
                        following data sets are supported: ASJC [<CSV-file> |
                        <URL>] (defaults to internal table); Crossref
                        <container-directory>; DOAJ [<CSV-file> | <URL>]
                        (defaults to https://doaj.org/csv); funder-names
                        [<CSV-file> | <URL>] (defaults to
                        https://doi.crossref.org/funderNames?mode=list);
                        journal-names [<CSV-file> | <URL>] (defaults to
                        http://ftp.crossref.org/titlelist/titleFile.csv);
                        ORCID <summaries.tar.gz-file> ROR <zip-file>;
  -E OUTPUT_ENCODING, --output-encoding OUTPUT_ENCODING
                        Query output character encoding (use utf-8-sig for
                        Excel)
  -F FIELD_SEPARATOR, --field-separator FIELD_SEPARATOR
                        Character to use for separating query output fields
  -H, --header          Include a header in the query output
  -i [INDEX ...], --index [INDEX ...]
                        SQL expressions that select the populated rows
  -L LIST_SCHEMA, --list-schema LIST_SCHEMA
                        List the schema of the specified database. The
                        following names are supported: Crossref, ORCID, ROR,
                        other, all
  -l LINKED_RECORDS, --linked-records LINKED_RECORDS
                        Only add ORCID records that link to existing <persons>
                        or <works>
  -n, --normalize       Normalize relations in the populated Crossref database
  -o OUTPUT, --output OUTPUT
                        Output file for query results
  -P, --partition       Run the query over partitioned data slices. (Warning:
                        arguments are run per partition.)
  -p POPULATE_DB_PATH, --populate-db-path POPULATE_DB_PATH
                        Populate the SQLite database in the specified path
  -Q QUERY_FILE, --query-file QUERY_FILE
                        File containing query to run on the virtual tables
  -q QUERY, --query QUERY
                        Query to run on the virtual tables
  -R ROW_SELECTION_FILE, --row-selection-file ROW_SELECTION_FILE
                        File containing SQL expression that selects the
                        populated rows
  -r ROW_SELECTION, --row-selection ROW_SELECTION
                        SQL expression that selects the populated rows
  -s SAMPLE, --sample SAMPLE
                        Python expression to sample the Crossref tables (e.g.
                        random.random() < 0.0002)
  -x EXECUTE, --execute EXECUTE
                        Operation to execute on the data. This can be one of:
                        link-aa-base-ror (link author affiliations to base-
                        level research organizations); link-aa-top-ror (link
                        author affiliations to top-level research
                        organizations); link-works-asjcs (link works with
                        Scopus All Science Journal Classification Codes —
                        ASJCs).

Python API

After downloading the Crossref data, the functionality of alexandria3k can be accessed through its Python API, either interactively (for exploratory data analytics) or through Python scripts (for long-running jobs and for documenting research methods as repeatable processes).

Create a Crossref object

Crossref functionality is accessed by means of a corresponding object created by specifying the data directory.

from alexandria3k.crossref import Crossref

crossref_instance = Crossref('April 2022 Public Data File from Crossref')

You can also add a parameter indicating how to sample the containers.

from random import random, seed

from alexandria3k.crossref import Crossref

# Randomly (but deterministically) sample 1% of the containers
seed("alexandria3k")
crossref_instance = Crossref('April 2022 Public Data File from Crossref',
  lambda _name: random() < 0.01)

Iterate through the DOI and title of all publications

for (doi, title) in crossref_instance.query('SELECT DOI, title FROM works'):
    print(doi, title)

Create a dictionary of which 2021 publications were funded by each body

Here partition=True is passed to the query method in order to have the query run separately (and therefore efficiently) on each Crossref data container.

from collections import defaultdict

works_by_funder = defaultdict(list)

for (funder_doi, work_doi) in crossref_instance.query(
    """
   SELECT work_funders.doi, works.doi FROM works
       INNER JOIN work_funders on work_funders.work_id = works.id
       WHERE published_year = 2021
    """,
    partition=True,
):
    works_by_funder[funder_doi].append(work_doi)

Database of COVID research

The following command creates an SQLite database with all Crossref data regarding publications that contain "COVID" in their title or abstract.

crossref_instance.populate(
    "covid.db", condition="title like '%COVID%' OR abstract like '%COVID%'"
)

Reference graph

The following command populates an SQLite database by selecting only a subset of columns of the complete Crossref data set to create a navigable graph between publications and their references.

crossref_instance.populate(
    "references.db",
    columns=[
        "works.id",
        "works.doi",
        "work_references.work_id",
        "work_references.doi",
    ],
    condition="work_references.doi is not null",
)

Record selection from external database

The following command creates an SQLite database with all Crossref data of works whose DOI appears in the attached database named selected.db.

crossref_instance.populate(
    "selected-works.db",
    condition="EXISTS (SELECT 1 FROM attached.selected_dois WHERE works.doi = selected_dois.doi)",
    ["attached:selected.db"]
)

Populate the database from ORCID

Add tables containing author country and education organization. Only records of authors identified in the publications through an ORCID will be added.

from alexandria3k import orcid

orcid.populate(
    "database.db",
    "ORCID_2022_10_summaries.tar.gz",
    columns=[
        "person_countries.*",
        "person_educations.orcid",
        "person_educations.organization_name",
    ],
    authors_only=True,
    works_only=False
)

Populate the database with journal names

from alexandria3k import csv_sources

csv_sources.populate_journal_names(
    "database.db",
    "http://ftp.crossref.org/titlelist/titleFile.csv"
)

Populate the database with funder names

from alexandria3k import csv_sources

csv_sources.populate_funder_names(
    "database.db",
    "https://doi.crossref.org/funderNames?mode=list"
)

Populate the database with data regarding open access journals

from alexandria3k import csv_sources

csv_sources.populate_open_access_journals(
    "database.db",
    "https://doaj.org/csv"
)

Work with Scopus All Science Journal Classification Codes (ASJC)

from alexandria3k import csv_sources

# Populate database with ASJCs
csv_sources.populate_asjc("database.db", "resource:data/asjc.csv")

# Link the (sometime previously populated works table) with ASJCs
csv_sources.link_works_asjcs("database.db")

Populate the database with the names of research organizations

from alexandria3k import ror

ror.populate(
    "v1.17.1-2022-12-16-ror-data.zip",
    "database.db"
)

Link author affiliations with research organization names

from alexandria3k import ror

# Link affiliations with best match
ror.link_author_affiliations(args.populate_db_path, link_to_top=False)

# Link affiliations with top parent of best match
ror.link_author_affiliations(args.populate_db_path, link_to_top=True)

Development

For software development purposes alexandria3k can also be installed and used through its Github repository, rather than as a Python package.

Installation

git clone https://github.com/dspinellis/alexandria3k.git
cd alexandra3k/src
pipenv install

CLI use while developing

# While in alexandria3k/src directory

# Install development dependencies
pipenv install --dev

# Launch a shell in the virtual environment
pipenv shell

# Navigate to the top directory
cd ..

# Run the CLI
bin/alexandria3k --help

Testing

Python unit and integration tests

# While in the top-level directory
python3 -m unittest discover

SQL unit tests

To run SQL unit tests install rdbunit and the SQLite command-line tool.

# While in the top-level directory
for t in tests/*.rdbu; do rdbunit --database=sqlite $t | sqlite3 ; done

Formatting

# While in the top-level directory
black -l 79 src/alexandria3k/*.py

Linting

# While in the top-level directory
python -m pylint src/alexandria3k/*.py

Building

# While in the top-level directory
python3 -m build

Package name derivation

The alexandria3k package is named after the Library of Alexandria, indicating how publication data can be processed in the third millenium AD.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alexandria3k-2.5.0.tar.gz (989.5 kB view hashes)

Uploaded Source

Built Distribution

alexandria3k-2.5.0-py3-none-any.whl (67.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page