Skip to main content

not yet

Project description

principledinvestigator

A papers recomendation tool

principledinvestigator compares papers in your library against a database of scientific papers to find new papers that you might be interested in. While there's a few services out there that try to do the same, principledinvestigator is unique in several ways:

  • principledinvestigator is completely open source, you can get the code and tweak it to improve the recomendation engine
  • principledinvestigator doesn't just use a single paper or a subset of (overly generic) keywords to find new papers, instead it compares all of your papers' abstracts against a database of papers metadata, producing much more relevant results

disclaimer

The dataset used here is a subset of a larger dataset of scientific papers. The dataset if focused on neuroscience papers published in the latest 30 years. If you want to include older papers or are interested in another field, then follow the instructions to create your custom database.

(possible) future improvements

  • use scibert instead of tf-idf for creating the embedding. This should also make it possible to embed the database's papers before use (unlike tf-idf which needs to run on the entire corpus every time).

Overview

The core feature making principledinvestigator unique among papers recomendation systems is that it analyzes your entire library of papers and matches it against a vast database of scientific papers to find new relevant papers. This is obviously an improvement compared e.g. to finding papers similar to one paper you like. In addition, principledinvestigator doesn't just use things like "title", "authors", "keywords"... to find new matches, instead it finds similar papers using Term Frequency-Inverse Document Frequency to asses the similarity across papers abstracts, thus using much more information about the papers' content.

Usage

First, you need to get data about your papers you want to use for the search. The best way is to export your library (or a subset of it) directly to a .bib file using your references menager of choice.

Then, you can use...

Making your own database

principledinvestigator uses a subset of the vast and eccelent corpus of scientific publications' metadata from Semanthic Scholar. The dataset used by principledinvestigator is focused on neuroscience papers written in english and published in the last 30 years. If you wish to include a different set of papers in your database, you can make your custom database and use it with principledinvestigator by executing the following steps.

1. Download whole corpus

You'll first need to download the whole corpus from Semantic Scholar. You can find the data and download instructions here. Once the data are downloaded, save them in a folder where you want to base your dataset-creation process

2. Uncompressing data

The downloaded corpus is compressed. To uncompress the files use principledinvestigator.database_preprocessing.upack_database pasing to it the path to the folder where you've downloaded the data.

3. Specifying your parameters

The selection of a subset of papers from the corpus is based on a set of parameters (e.g. year of publication) matched against criteria specified (and described) in principledinvestigator.settings. Edit the criteria to adapt the dataset selection to your needs

4. Creating the dataset

Simply run principledinvestigator.database_preprocessing.make_database

5. Training doc2vec model

Papers semanthic similarity is estimated using a doc2vec model trained on the entire dataset. After modifying the dataset to your needs, you'll have to re-train the model by running principledinvestigator.doc2vec.train_doc2vec_model

summary:

An example code for creating your dataset (after having downloaded the corpus and edited the settings)

from principledinvestigator.database_preprocessing import upack_database, make_database
from principledinvestigator.doc2vec import train_doc2vec_model
from pathlib import Path

folder = Path('path to your data')

# unpack and create
unpack_database(folder)
make_database(folder)

# train new d2v model
train_doc2vec_model()

Project details


Release history Release notifications | RSS feed

This version

0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

principledinvestigator-0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

principledinvestigator-0-py3-none-any.whl (38.8 kB view details)

Uploaded Python 3

File details

Details for the file principledinvestigator-0.tar.gz.

File metadata

  • Download URL: principledinvestigator-0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.4

File hashes

Hashes for principledinvestigator-0.tar.gz
Algorithm Hash digest
SHA256 b56472fabeb520e8fa064c02e80ed92653de86b9e32ecca51ee62ef66f684dc3
MD5 76a9cd48220c79798ad1817190603af2
BLAKE2b-256 08f38cdfd95a7457e8603a29b7f9a2f70a176d84f957da7bf0c67592ec15ed97

See more details on using hashes here.

File details

Details for the file principledinvestigator-0-py3-none-any.whl.

File metadata

  • Download URL: principledinvestigator-0-py3-none-any.whl
  • Upload date:
  • Size: 38.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.4

File hashes

Hashes for principledinvestigator-0-py3-none-any.whl
Algorithm Hash digest
SHA256 9784206dc0f58f841b227ebe18db02b366f1b6c47ebc3f51bb4250e082ceabe9
MD5 73a41d763fb59e8b54b5b3b1a37d7657
BLAKE2b-256 87edc35f81d1e104e133920a7e5b68d3a6c3ca3ba002149f63cadfe03d9e0720

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page