Library for ranking relevant papers based on a set of seed papers
Project description
Jason Portenoy 2018
This is code and sample data accompanying the paper:
Supervised Learning for Automated Literature Review
published in the proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019)
Starting with a list of seed papers, get candidate papers by following in- and out-citations (2 degrees). Then, train a classifier to rank the candidate papers. Repeat this a number of times to get an aggregate ranking for many candidate papers.
Example script in scripts/run_autoreview.py
Inputs:
List of paper IDs for the seed set.
Data for paper citations.
Paper data to be used as features for the classifiers (e.g., clusters, eigenfactor, titles, etc.)
Parameters:
Size of the initial split
Number of times to perform the overall process of collecting candidate papers and training a classifier
Output:
List of papers not in the seed set, ordered descending by relevance score.
Installation
Install via PyPI:
pip install autoreview
Example
Apache Spark (https://spark.apache.org/downloads.html) must be installed to run the example.
The environment variable SPARK_HOME must be set (preferably in a .env file) with the path to Spark.
Java version 8 is required to be used with Spark. Make sure Java 8 is installed and point to its path with the environment variable JAVA_HOME.
Example .env file:
SPARK_HOME=/home/spark-2.4.0-bin-hadoop2.7 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Create a virtual environment and install the required libraries:
virtualenv venv source venv/bin/activate pip install -r requirements.txt
Run the full autoreview pipeline using sample data:
python scripts/run_autoreview.py --id-list sample_data/sample_IDs_MAG.txt --citations sample_data/MAG_citations_sample --papers sample_data/MAG_papers_sample --sample-size 15 --random-seed 999 --id-colname Paper_ID --cited-colname Paper_Reference_ID --outdir sample_data/sample_output --debug
This is just meant to show how the system operates. It will not provide meaningful results with such a small sample of paper and citation data.
It will output the top predictions in sample_data/sample_output/predictions.tsv.
Development
For new releases:
# increment the version number bump2version patch
Replace patch with minor or major as needed. Then:
# push new release to github git push --tags # build and upload to PyPI python setup.py sdist bdist_wheel twine check dist/* twine upload dist/*
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file autoreview-0.2.5.tar.gz
.
File metadata
- Download URL: autoreview-0.2.5.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5477a052cd9b6ae57bd475d40a03d3d861b7f6a55167b77fd824eb6cbd4da5fd |
|
MD5 | bbef9d03f6a3970f26d58c6ec7e8f9a8 |
|
BLAKE2b-256 | 8ebfe88f7ad1eddfbec00fdf027fdc14f280d49524c449f13fc4d92a9e13301d |
File details
Details for the file autoreview-0.2.5-py2.py3-none-any.whl
.
File metadata
- Download URL: autoreview-0.2.5-py2.py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2178c3816ab167cd5a6c788aa74e68c25c0b65fc3e3b017830851251a3cdde62 |
|
MD5 | 5beebb7c8b87337fb4e66cdadf9aa72b |
|
BLAKE2b-256 | 3c76888c9b38eca9577a2f36dd2fb18b5ef1852c719625d0ee8ad6f401d767fe |