Skip to main content

Scientific literature data exploration analysis

Project description

sciterra: a python library for similarity-based scientometrics

build

Sciterra is a software libary to support data-driven analyses of scientific literature, with a focus on unifying different bibliographic database APIs and document-embedding methods for systematic scientometrics research.

Overview

The main purpose of sciterra is to perform similarity-based retrieval of scientific publications for metascience/scientometrics research. While there are many services that can make the individual steps of this simple, this software library exists to

  1. Unify the different APIs and vector-based retrieval methods

  2. Support scientometrics analyses of citation dynamics, especially with respect to a vectorized 'landscape' of literature.

Installing sciterra

First, set up a virtual environment (e.g. via miniconda, conda create -n sciterra, and conda activate sciterra).

  1. Install sciterra via git:

    python -m pip install 'sciterra @ git+https://github.com/nathimel/sciterra.git'

  2. Alternatively, download or clone this repository and navigate to the root folder, and install locally:

    pip install -e .

  3. It is not yet recommended because sciterra is still in development, but you can also install via pip from pypi:

    pip install sciterra

You will also need to download a trained pipeline for spacy:

python -m spacy download en_core_web_sm

Optional: If you plan on querying the NASA Astrophysical Data System (ADS), you must have an ADS API key saved at ~/.ads/dev_key. To generate an ADS API key navigate to the ADS web interface, create and sign in to an ADS account, and navigate to Settings > API Token.

Tests

To run all the unit tests for sciterra, found at src/tests, run the following command at the root of the repository:

pytest

This may take up to several hours in total, due to slow api calls in test_cartography and test_tracing.

Note: If you opted not to set up authentication for ADS during the set up, the tests in test_librarian.TestADSLibrarian and the test test_tracing.TestExpansion.test_atlas_tracer_ads will fail.

Usage

Atlas

The central object in sciterra is the Atlas. This is a basic data structure for containing scientific publications that are returned from calls to various bibliographic database APIs.

An Atlas minimally requires a list of Publications.

Publication

A publication object is a minimal wrapper around publication data, and should have a string identifier. It is designed to standardize the basic metadata contained in the results from some bibliographic database API.

from sciterra import Atlas, Publication

atl = Atlas([Publication({"identifier": "id"})])

Alternatively, you can construct an Atlas by passing in a .bib file. The entries in this bibtex file will be parsed for unique identifiers (e.g., DOIs), and sent in an API call, and returned as Publications, which then populate an Atlas.

atl = crt.bibtex_to_atlas(bibtex_filepath)

In the line of code above, the variable crt is an instance of a Cartographer object, which encapsulates the bookkeeping involved in querying a bibliographic database for publications.

Cartographer

The Cartographer class is named because interfaces with an Atlas to build out a library of publications. Since it does so via similarity-based retrieval, the resulting Atlas can be considered a 'region' of publications.

To do this, a Cartographer needs two things: an API with which to interface, and a way of getting document embeddings. Both are encapsulated, respectively, by the Librarian and the Vectorizer classes.

from sciterra import Cartographer
from sciterra.librarians import SemanticScholarLibrarian # or ADSLibrarian
from sciterra.vectorization import SciBERTVectorizer # among others

crt = Cartographer(
    librarian=SemanticScholarLibrarian(),
    vectorizer=SciBERTVectorizer(),
)

Librarian

Each Librarian subclass is designed to be a wrapper for an existing python API service, such as the ads package or the semanticscholar client library.

A Librarian subclass also overrides two methods. The first is get_publications, which takes a list of identifiers, should query the specific API for that Librarian, and returns a list of Publications. Keyword arguments can be passed to specify the metadata that is kept for each publication (e.g. date, title, journal, authors, etc.) The second method is convert_publication, which defines how the result of an API call should be converted to a sciterra Publication object.

Contributions to sciterra in the form of new Librarian subclasses are encouraged and appreciated.

Vectorizer

Vectorizer subclasses override one function, embed_documents, which takes a list of strings, representing the text of a publication (currently, just its abstract), and returns an np.ndarray of embeddings.

Under the hood, the project method of Cartographer, which is used during similarity-based retrieval, uses the vectorizer roughly as follows

# Get abstracts
docs = [atlas[identifier].abstract for identifier in identifiers]

# Embed abstracts
result = vectorizer.embed_documents(docs)
embeddings = result["embeddings"]

# depending on the vectorizer, sometimes not all embeddings can be obtained due to out-of-vocab issues
success_indices = result["success_indices"] # shape `(len(embeddings),)`
fail_indices = result["fail_indices"] # shape `(len(docs) - len(embeddings))``

Currently, sciterra has vectorizers using SciBERT, SBERT, GPT-2, Word2Vec, and a simple bag-of-words (BOW) vectorizer that uses the same vocabulary as the Word2Vec vectorizer. Contributions to sciterra in the form of new Vectorizer subclasses are also encouraged and appreciated.

Putting it all together

The main use case for all of these ingredients is to iteratively build out a region of publications. This is done using iterate_expand:

from sciterra.mapping.tracing import iterate_expand

# Assuming the initial atlas contains just one publication
(atl.center, ) = atl.publications.keys()
# build out an atlas to contain 10,000 publications, with increasing dissimilarity to the initial publication, saving progress in binary files to the directory named "atlas".
iterate_expand(
    atl=atl,
    crt=crt,
    atlas_dir="atlas",
    target_size=10000,
    center=atl.center,
)

This method has a number of keyword arguments that enable tracking the Atlas expansion, limiting the number of publications per expansion, how many times to try to get a response if there are connection issues, etc.

In practice, it may be helpful to use the sciterra.mapping.tracing.AtlasTracer data structure to reduce most of the loading/initialization boilerplate described above. For an example, see main.py.

Additional features

  • The topography submodule contains similarity-based metrics for publications, to support scientometrics analyses.

Acknowledgments

This software is a reimplimentation of Zachary Hafen-Saavedra's library, cc.

To cite sciterra, please use the following workshop paper,

@inproceedings{Imel2023,
 author = {Imel, Nathaniel, and Hafen, Zachary},
 title = {Citation-similarity relationships in astrophysics},
 booktitle = {AI for Scientific Discovery: From Theory to Practice Workshop (AI4Science @ NeurIPS)},
 year = {2023},
 url = {https://openreview.net/pdf?id=mISayy7DPI},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sciterra-0.0.21.tar.gz (48.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sciterra-0.0.21-py3-none-any.whl (57.1 kB view details)

Uploaded Python 3

File details

Details for the file sciterra-0.0.21.tar.gz.

File metadata

  • Download URL: sciterra-0.0.21.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for sciterra-0.0.21.tar.gz
Algorithm Hash digest
SHA256 77ba94c77fdd2dd7fd2c7267f7420142efa7dd64c0e8aa83dc97da7d55389c97
MD5 5cd671150ee12496efed524879083a97
BLAKE2b-256 17dcdbe3cda33046feca2d412bb01c363d36e8ca95e8de84231e9fb284791ddb

See more details on using hashes here.

File details

Details for the file sciterra-0.0.21-py3-none-any.whl.

File metadata

  • Download URL: sciterra-0.0.21-py3-none-any.whl
  • Upload date:
  • Size: 57.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for sciterra-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 75f005556e1eeda0547d4aac963a50eb01c7484d2f583e55cf196f937d4beb2b
MD5 9e1caab0f72d88860916df680a285a33
BLAKE2b-256 0590e203064586752739033a17e8892bc9808acdb65842ca951ba77216d87d76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page