Skip to main content

Semantic search and workflows for medical/scientific papers

Project description

Semantic search and workflows for medical/scientific papers

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


paperai is a semantic search and workflow application for medical/scientific papers.

demo

Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.

architecture

paperai and/or NeuML has been recognized in the following articles:

Installation

The easiest way to install is via pip and PyPI

pip install paperai

Python 3.7+ is supported. Using a Python virtual environment is recommended.

paperai can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperai

See this link to help resolve environment-specific install issues.

Docker

Run the steps below to build a docker image with paperai and all dependencies.

wget https://raw.githubusercontent.com/neuml/paperai/master/docker/Dockerfile
docker build -t paperai .
docker run --name paperai --rm -it paperai

paperetl can be added in to have a single image to index and query content. Follow the instructions to build a paperetl docker image and then run the following.

docker build -t paperai --build-arg BASE_IMAGE=paperetl --build-arg START=/scripts/start.sh .
docker run --name paperai --rm -it paperai

Examples

The following notebooks and applications demonstrate the capabilities provided by paperai.

Notebooks

Notebook Description
Introducing paperai Overview of the functionality provided by paperai Open In Colab

Applications

Application Description
Search Search a paperai index. Set query parameters, execute searches and display results.

Building a model

paperai indexes databases previously built with paperetl. The following shows how to create a new paperai index.

  1. (Optional) Create an index.yml file

    paperai uses the default txtai embeddings configuration when not specified. Alternatively, an index.yml file can be specified that takes all the same options as a txtai embeddings instance. See the txtai documentation for more on the possible options. A simple example is shown below.

    path: sentence-transformers/all-MiniLM-L6-v2
    content: True
    
  2. Build embeddings index

    python -m paperai.index <path to input data> <optional index configuration>
    

The paperai.index process requires an input data path and optionally takes index configuration. This configuration can either be a vector model path or an index.yml configuration file.

Running queries

The fastest way to run queries is to start a paperai shell

paperai <path to model directory>

A prompt will come up. Queries can be typed directly into the console.

Building a report file

Reports support generating output in multiple formats. An example report call:

python -m paperai.report report.yml 50 md <path to model directory>

The following report formats are supported:

  • Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.
  • CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.
  • Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.

In the example above, a file named report.md will be created. Example report configuration files can be found here.

Tech Overview

paperai is a combination of a txtai embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Embeddings are built over the full corpus.

Multiple entry points exist to interact with the model.

  • paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results.
  • paperai.query - Runs a single query from the terminal
  • paperai.shell - Allows running multiple queries from the terminal

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperai-2.2.1.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

paperai-2.2.1-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file paperai-2.2.1.tar.gz.

File metadata

  • Download URL: paperai-2.2.1.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for paperai-2.2.1.tar.gz
Algorithm Hash digest
SHA256 11e0081d5c75f39fbc87cf685749168781c8ce5e79e6dc28d57563ca44741904
MD5 2b625e6bc87302395f25cbdbea27a083
BLAKE2b-256 34267592dcee2568f7aab7092f4da1e678c9a80ea468f8d7d15888ed524cb767

See more details on using hashes here.

File details

Details for the file paperai-2.2.1-py3-none-any.whl.

File metadata

  • Download URL: paperai-2.2.1-py3-none-any.whl
  • Upload date:
  • Size: 31.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for paperai-2.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7117d8c882d53b590fe09dda7d6f3d49d279b5ba323b387c694c5ca45283948b
MD5 26b7d49489a8376ea40e945f8861f885
BLAKE2b-256 4912a141296ee17a683ac9a6b15b992a4dea4a1b5454030466ccd4bf88acd2f3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page