Skip to main content

Filters RSS feeds, predicts interest, and notifies Slack with top academic articles.

Project description

PaperSorter

PaperSorter is an academic paper recommendation system that utilizes machine learning techniques to match users' interests. The system retrieves article alerts from RSS feeds and processes the title, author, journal name, and abstract of each article using Upstage's Solar LLM to generate embedding vectors. These vectors serve as input for a regression model that predicts the user's level of interest in each paper. PaperSorter sends notifications about high-scoring articles to a designated Slack channel, enabling timely discussion of relevant publications among colleagues. The prediction model can be trained incrementally with additional labels for new articles provided by user.

Installing

To install PaperSorter, use pip:

pip install papersorter

Preparing

TheOldReader

PaperSorter uses TheOldReader as its feed source. After signing up for TheOldReader, you will receive API access using your email and password. Before running PaperSorter, make sure to set the TOR_EMAIL and TOR_PASSWORD environment variables with your TheOldReader email and password, respectively. This will allow PaperSorter to authenticate and retrieve the necessary data from your feeds.

Upstage Solar LLM

Solar LLM's embedding API converts article titles and contents into numerical vectors. Sign up on the Upstage console and create an API key as per the documentation. Store the key securely and set the UPSTAGE_API_KEY environment variable before running PaperSorter.

Slack Incoming WebHook

To send notifications to a Slack channel, create an incoming webhook address as described in the Slack documentation. Store the address securely and set the PAPERSORTER_WEBHOOK_URL environment variable before running PaperSorter.

Initialization and Training

To train a predictor for your article interests, ensure your TheOldReader account contains at least 1000 articles, including at least 100 positively labeled articles marked with stars. Ideally, aim for around 5000 articles with 500 starred items for optimal performance.

After populating your TheOldReader account, initialize the feed and embedding databases using:

papersorter init

Next, train your first model with:

papersorter train

If the ROCAUC performance metric meets your expectations, you're ready to send notifications about new interesting articles.

Getting Updates and Send Notifications

For the regular updates, this command retrieves updates, converts new items to embeddings, and finds interesting articles:

papersorter update

To send notifications for new interesting articles, run:

papersorter broadcast

You will receive formatted notifications in your Slack channel.

Running as a Cron Job

Here is an example of a shell script that runs PaperSorter's update and broadcast jobs in the background. This script sends notifications about new interesting articles between 7 am and 9 pm, while only performing updates during the night.

#!/bin/bash
PAPERSORTER_CMD=/path/to/papersorter
PAPERSORTER_DATADIR=/path/to/data
LOGFILE=background-updates.log
CURRENT_HOUR=$(date +%H)

cd $PAPERSORTER_DATADIR
$PAPERSORTER_CMD update -q --log-file $LOGFILE

if [ "$CURRENT_HOUR" -ge 7 ] && [ "$CURRENT_HOUR" -le 21 ]; then
    $PAPERSORTER_CMD broadcast -q --log-file $LOGFILE
fi

Here is an example line for the crontab. It runs the update script on every hour at ten minutes past the hour.

10 * * * * /bin/bash /path/to/run-update.sh

Feedback and Updating the Model

To improve the model, provide more labels for the articles. First, extract the list of articles with the following command:

papersorter train -o model-temporary.pkl -f feedback.xlsx

This generates an Excel file, feedback.xlsx, containing titles, authors, prediction scores, and other details. Review each row and fill in the label column with 1 (interesting) or 0 (not interesting). Leave it blank if unsure. Once you've labeled some articles, update the feed database with:

papersorter feedback -i feedback.xlsx

Retrain the predictor with the updated labels using:

papersorter train

The new predictor is stored as model.pkl, and your next feeds will be assessed with the updated model.

Author

Hyeshik Chang hyeshik@snu.ac.kr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papersorter-0.2.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

papersorter-0.2-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file papersorter-0.2.tar.gz.

File metadata

  • Download URL: papersorter-0.2.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for papersorter-0.2.tar.gz
Algorithm Hash digest
SHA256 dae2bb1098b06550255394a12e70c871c21b1892ad9e56ce91658e5f8d39b80a
MD5 d0dab988c2e2b1c58f0c77f34ee9bb3a
BLAKE2b-256 7c6fbf7468bfddb71ed7e0ac61d275f9d3ef9e2fd54eda0d6f2ce76385e44d8b

See more details on using hashes here.

File details

Details for the file papersorter-0.2-py3-none-any.whl.

File metadata

  • Download URL: papersorter-0.2-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for papersorter-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 41a16fa39c3ad7b246208aefd07c6a48b35f90e4d792324d407c4f0ff9e53a03
MD5 c05479efe7a4a484b6a7d63bce736c33
BLAKE2b-256 1ed3f2a5eaf3e81a6d9bf9f278ca24b859ee1f0599d23b3f00d82596f5fb0372

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page