Skip to main content

AI-powered literature discovery and review engine for medical/scientific papers

Project description

AI-powered literature discovery and review engine for medical/scientific papers

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


paperai is an AI-powered literature discovery and review engine for medical/scientific papers. paperai helps automate tedious literature reviews allowing researchers to focus on their core work. Queries are run to filter papers with specified criteria. Reports powered by extractive question-answering are run to identify answers to key questions within sets of medical/scientific papers.

paperai was used to analyze the COVID-19 Open Research Dataset (CORD-19), winning multiple awards in the CORD-19 Kaggle challenge.

paperai and/or NeuML has been recognized in the following articles:


The easiest way to install is via pip and PyPI

pip install paperai

Python 3.7+ is supported. Using a Python virtual environment is recommended.

paperai can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+

See this link to help resolve environment-specific install issues.


A Dockerfile with commands to install paperai, all dependencies and scripts is available in this repository.

Clone this git repository and run the following to build and run the Docker image.

docker build -t paperai -f docker/Dockerfile .
docker run --name paperai --rm -it paperai

This will bring up a paperai command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content. All scripts in the following examples are available in this environment.

paperetl's Dockerfile can be combined with this Dockerfile to have a single image that can index and query content. The files from the paperetl project scripts directory needs to be placed in paperai's scripts directory. The paperetl Dockerfile also needs to be copied over (it's referenced as paperetl.Dockerfile here).

docker build -t base -f docker/Dockerfile .
docker build -t paperai --build-arg BASE_IMAGE=base -f docker/paperetl.Dockerfile .
docker run --name paperai --rm -it paperai


The following notebooks and applications demonstrate the capabilities provided by paperai.


Notebook Description
CORD-19 Analysis with Sentence Embeddings Builds paperai-based submissions for the CORD-19 Challenge
CORD-19 Report Builder Template for building new reports


Application Description
Search Search a paperai index. Set query parameters, execute searches and display results.

Building a model

paperai indexes databases previously built with paperetl. paperai currently supports querying SQLite databases.

The following sections show how to build an embeddings index for a SQLite articles database. This example assumes the database and model path is cord19/models. Substitute as appropriate.

  1. Get vector model

    Run following script to download CORD-19 fastText vectors

    scripts/ cord19/vectors

    A full vector model build for fastText models can optionally be run with the following command.

    python -m paperai.vectors cord19/models
  2. Build embeddings index

    python -m paperai.index cord19/models cord19/vectors/cord19-300d.magnitude

The paperai.index process takes two required arguments, the model path and the vector model path. In this case, the vector model is a CORD-19 fastText model but it can also be any supported transformers model.

Building a report file

Reports support generating output in multiple formats. An example report call:

python -m report.yml 50 md cord19/models

The following report formats are supported:

  • Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.
  • CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.
  • Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.

In the example above, a file named will be created. Example report configuration files can be found here.

Running queries

The fastest way to run queries is to start a paperai shell

paperai cord19/models

A prompt will come up. Queries can be typed directly into the console.

Tech Overview

The model is a combination of a sentence embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Sentence embeddings are built over the full corpus. The sentence embeddings index only uses tagged articles, which helps produce the most relevant results.

Multiple entry points exist to interact with the model.

  • - Builds a markdown report for a series of queries. For each query, the best articles are shown, top matches from those articles and a highlights section which shows the most relevant sections from the embeddings search for the query.
  • paperai.query - Runs a single query from the terminal
  • - Allows running multiple queries from the terminal

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperai-2.0.0.tar.gz (23.9 kB view hashes)

Uploaded source

Built Distribution

paperai-2.0.0-py3-none-any.whl (31.4 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page