Skip to main content

Library for ranking relevant papers based on a set of seed papers

Project description

Jason Portenoy 2018

This is code and sample data accompanying the paper:

Supervised Learning for Automated Literature Review

published in the proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019)

Starting with a list of seed papers, get candidate papers by following in- and out-citations (2 degrees). Then, train a classifier to rank the candidate papers. Repeat this a number of times to get an aggregate ranking for many candidate papers.

Example script in scripts/run_autoreview.py

Inputs:

  • List of paper IDs for the seed set.

  • Data for paper citations.

  • Paper data to be used as features for the classifiers (e.g., clusters, eigenfactor, titles, etc.)

Parameters:

  • Size of the initial split

  • Number of times to perform the overall process of collecting candidate papers and training a classifier

Output:

  • List of papers not in the seed set, ordered descending by relevance score.

Installation

Install via PyPI:

pip install autoreview

Example

  • Apache Spark (https://spark.apache.org/downloads.html) must be installed to run the example.

  • The environment variable SPARK_HOME must be set (preferably in a .env file) with the path to Spark.

    • Java version 8 is required to be used with Spark. Make sure Java 8 is installed and point to its path with the environment variable JAVA_HOME.

    • Example .env file:

      SPARK_HOME=/home/spark-2.4.0-bin-hadoop2.7
      JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  • Create a virtual environment and install the required libraries:

    virtualenv venv
    source venv/bin/activate
    pip install -r requirements.txt
  • Run the full autoreview pipeline using sample data:

    python scripts/run_autoreview.py --id-list sample_data/sample_IDs_MAG.txt --citations sample_data/MAG_citations_sample --papers sample_data/MAG_papers_sample --sample-size 15 --random-seed 999 --id-colname Paper_ID --cited-colname Paper_Reference_ID --outdir sample_data/sample_output --debug
  • This is just meant to show how the system operates. It will not provide meaningful results with such a small sample of paper and citation data.

  • It will output the top predictions in sample_data/sample_output/predictions.tsv.

Development

For new releases:

# increment the version number
bump2version patch

Replace patch with minor or major as needed. Then:

# push new release to github
git push --tags

# build and upload to PyPI
python setup.py sdist bdist_wheel
twine check dist/*
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoreview-0.2.5.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

autoreview-0.2.5-py2.py3-none-any.whl (22.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file autoreview-0.2.5.tar.gz.

File metadata

  • Download URL: autoreview-0.2.5.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1

File hashes

Hashes for autoreview-0.2.5.tar.gz
Algorithm Hash digest
SHA256 5477a052cd9b6ae57bd475d40a03d3d861b7f6a55167b77fd824eb6cbd4da5fd
MD5 bbef9d03f6a3970f26d58c6ec7e8f9a8
BLAKE2b-256 8ebfe88f7ad1eddfbec00fdf027fdc14f280d49524c449f13fc4d92a9e13301d

See more details on using hashes here.

File details

Details for the file autoreview-0.2.5-py2.py3-none-any.whl.

File metadata

  • Download URL: autoreview-0.2.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1

File hashes

Hashes for autoreview-0.2.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2178c3816ab167cd5a6c788aa74e68c25c0b65fc3e3b017830851251a3cdde62
MD5 5beebb7c8b87337fb4e66cdadf9aa72b
BLAKE2b-256 3c76888c9b38eca9577a2f36dd2fb18b5ef1852c719625d0ee8ad6f401d767fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page