Skip to main content

A scientific papers recomendation tool.

Project description

refy

A scientific papers recomendation tool.

Overview

refy leverages Natural Langual Processing (NLP) machine learning tools to find new papers that might be relevant given the ones that you've read already. There's a few software tools out there that facilitate the exploration of scientific literature, including:

  • meta.org which allows users to set up feeds that identify newly published papers that seem relevant given a set of keywords
  • inciteful and scite.ai let you explore the network of citations around a given paper of interest
  • connected papers let's you visualize papers related to a given paper of interest

Most currently available software is limited in two key ways:

  1. Tools like meta.org rely on keywords, but keywords (e.g. computational neuroscience, Parkinso's Disease) are often overly general. As a result of that you have to sift through a lot irrelevant literature before you find something interesting
  2. Other tools like connected papers only work with one input paper at the time: you give it the title of a paper you've read and they give you suggestions. This is limiting: any software that can analyse all papers you've read can use a lot more information to find new papers that match more closely your interests.

This is what refy is for: refy analyzes the abstracts of several papers of yours and matches them agaist a database of almost ONE MILLION paper abstracts. By using many papers at once refy has a lot more information at its disposal which (hopefully) means that it can better recomend relevant papers. More details about the database used by refy can be found at the bottom of this document.

Disclaimer: The database used by refy is focused on neuroscience papers and preprints published in the last 30 years. If you are interested in older papers or work in a different field, please read the instructions below about how to adjust the database to your needs.

Usage

Installation

If you have an environment with python >= 3.6, you can install refy with:

pip install refy

You can check if everything went okay with:

refy example

which should print something like:

Usage

Usage provides a Command Line Interface (CLI) to expose it's functionality directly in your terminal:

Note: the first time you use refy it will have to download several files (which you can see here) with data it needs to work. This should only take a few minuts and it will require about 3GB of memory/

You can use refy in two modes:

  1. In query mode you can find papers relevant for a given input string (e.g. locomotion mouse brainstem)
  2. In suggest mode you give refy a .bib bibtext with metadata about as many publications as you want. refy will use this information to find papers relevant across all of your input papers

For query mode you will use the command refy query STRING while for suggest you'd use refy suggest PATH. In both cases you can use optional arguments:

    -N INTEGER            number of recomendations to show  [default: 10]
    -since INTEGER        Only keep papers published after SINCE
    -to INTEGER           Only keep papers published before TO
    -save-path, --s TEXT  Save suggestions to file
    -debug, --d           set debug mode ON/OFF  [default: False]

For example:

refy query "locomotion control brainstem" --N 100 --since 2015 --to 2018 --s refs.csv

Will show 100 suggested papers published between 2015 and 2018 and will save the results to refs.csv.

Note: in suggest mode, the content of your .bib file must include papers abstracts. Only papers with abstracts will be used for the analysis. Your entris should look like this:

@ARTICLE{Claudi2020-tb,
  title    = "Brainrender. A python based software for visualisation of
              neuroanatomical and morphological data",
  author   = "Claudi, Federico and Tyson, Adam L and Branco, Tiago",
  abstract = "Abstract Here we present brainrender, an open source python
              package for rendering three-dimensional neuroanatomical data
              aligned to the Allen Mouse Atlas. Brainrender can be used to
              explore, visualise and compare data from publicly available
              datasets (e.g. from the Mouse Light project from Janelia) as well
              as data generated within individual laboratories. Brainrender
              facilitates the exploration of neuroanatomical data with
              three-dimensional renderings, aiding the design and
              interpretation of experiments and the dissemination of anatomical
              findings. Additionally, brainrender can also be used to generate
              high-quality, publication-ready, figures for scientific
              publications.",
  journal  = "Cold Spring Harbor Laboratory",
  pages    = "2020.02.23.961748",
  month    =  feb,
  year     =  2020,
  language = "en",
  doi      = "10.1101/2020.02.23.961748"
}

hint: if you use reference managers like zotero or paperpile you can easily export bibtext data about your papers

The output of refy comes in two ways:

  1. It will print to terminal a list of N recomended paper, sorted by their recomendation score. In addition to the papers' title and year of publication, a url is shown. Clickling on the url should open the paper's web page in your browser (if your terminal supports links)
  2. Optionally, refy can save the list of recomended paper to .csv file so that you may explore these at your leasure.

Scripting

You can of course access all functionality through normal python scripts. For instance:

from refy import suggest

rec = suggest('my_library.bib')
print(rec.suggestions)

Under the hood

This section explains how refy works. If you just want to use refy and don't care about what happens under the hood then feel free to skip this.

refy uses NLP algorithms to estimate semantic similarity across papers based on the content of their abstracts. In particular, it uses Doc2Vec which is an adaptation of Word2Vec, a model that embeddings of words in which semantically similar words are closer in the embedding space than semantically dissimilar words are. Dov2Vec expands Word2Vec to learn vector embedding of entire documents.

The Doc2Vec model used here is train on the entire corpus of almost one million papers. When it comes finding recomendations for your papers, refy uses Doc2Vec to create a vector representation of your paper and find the N closest vectors which, hopefully, are papers that are similar to yours.

This operation is repeated for each paper in your .bib file and then recomendations are pooled and scored: the papers that scored highest for the most number of input papers will be the most strongly recomended ones.

Database

refy uses a curated database of about one million papers metadata to recomend literature for you. TJe data come from two sources:

Data from these two vasts databases (several million publications) are filtered to selectively keep neuoroscience and ML papers written in english and published in the last 30 years. This selection is necessary to keep the compute and memory requirements within reasonable bounds.

If you wish to create your custom database, these are the steps you'll need to follow:

  1. download the compressed data from semantic scholar's Open Corpus and save them in a folder. Note the entire database is >100GB in size even while compressed so it might take a while
  2. clone this repository with git clone https://github.com/FedeClaudi/refy.git
  3. in the shell cd refy
  4. within the cloned repository edit refy/settings.py. In it there are a few settings that apply to the creation of database (e.g. select papers based on the year of publication). Set these settings to values that match your need
  5. Install the edited version of refy with pip install . -U
  6. Create the edited database with refy update_database FOLDER where FOLDER is the path to where you saved the data downloaded from semantic scholar

Once the database update has finished (should take <5 hours), you can re-train the Doc2Vec model to fit it to your database with refy train (use refy train --help to see the options available). This step will like require ~12 hours but the duration depend on the specs of your machine and the size of the database as you've created it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refy-0.1a0.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

refy-0.1a0-py3-none-any.whl (91.2 kB view details)

Uploaded Python 3

File details

Details for the file refy-0.1a0.tar.gz.

File metadata

  • Download URL: refy-0.1a0.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2.post20210112 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.0

File hashes

Hashes for refy-0.1a0.tar.gz
Algorithm Hash digest
SHA256 13941f3a86ab9a98b997d7446b546f5ebe586d0d65df3464420af1e90e16fbfe
MD5 03d5af6530bd21cdcc3af5ffd8dc9f85
BLAKE2b-256 3c32b047ac24e132de21c0fc18f8f6b0dc27eec088eb9c6fbd88774a20213110

See more details on using hashes here.

File details

Details for the file refy-0.1a0-py3-none-any.whl.

File metadata

  • Download URL: refy-0.1a0-py3-none-any.whl
  • Upload date:
  • Size: 91.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2.post20210112 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.0

File hashes

Hashes for refy-0.1a0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc223144c7a9ad70efa26f7a928167415143e4ef907d84116aa3a9b312c47fb4
MD5 2bb25d4e9b2f171983d39b8d703a7a98
BLAKE2b-256 d9244f290bdf334e1a09fd1d806cbf7875c4500faa47e7ab432bfd9c4bb0bef0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page