Filters RSS feeds, predicts interest, and notifies Slack with top academic articles.
Project description
PaperSorter
PaperSorter is an academic paper recommendation system that utilizes machine learning techniques to match users' interests. The system retrieves article alerts from RSS feeds and processes the title, author, journal name, and abstract of each article using Upstage's Solar LLM to generate embedding vectors. These vectors serve as input for a regression model that predicts the user's level of interest in each paper. PaperSorter sends notifications about high-scoring articles to a designated Slack channel, enabling timely discussion of relevant publications among colleagues. The prediction model can be trained incrementally with additional labels for new articles provided by user.
Installing
To install PaperSorter, use pip:
pip install papersorter
Preparing
TheOldReader
PaperSorter uses TheOldReader as its
feed source. After signing up for TheOldReader, you will receive
API access using your email and password. Before running PaperSorter,
make sure to set the TOR_EMAIL
and TOR_PASSWORD
environment
variables with your TheOldReader email and password, respectively.
This will allow PaperSorter to authenticate and retrieve the necessary
data from your feeds.
Upstage Solar LLM
Solar LLM's embedding API converts article titles and contents into
numerical vectors. Sign up on the Upstage console
and create an API key as per the
documentation.
Store the key securely and set the UPSTAGE_API_KEY
environment
variable before running PaperSorter.
Slack Incoming WebHook
To send notifications to a Slack channel, create an incoming webhook
address as described in the Slack documentation.
Store the address securely and set the PAPERSORTER_WEBHOOK_URL
environment
variable before running PaperSorter.
Initialization and Training
To train a predictor for your article interests, ensure your TheOldReader account contains at least 1000 articles, including at least 100 positively labeled articles marked with stars. Ideally, aim for around 5000 articles with 500 starred items for optimal performance.
After populating your TheOldReader account, initialize the feed and embedding databases using:
papersorter init
Next, train your first model with:
papersorter train
If the ROCAUC performance metric meets your expectations, you're ready to send notifications about new interesting articles.
Getting Updates and Send Notifications
For the regular updates, this command retrieves updates, converts new items to embeddings, and finds interesting articles:
papersorter update
To send notifications for new interesting articles, run:
papersorter broadcast
You will receive formatted notifications in your Slack channel.
Running as a Cron Job
Here is an example of a shell script that runs PaperSorter's update
and broadcast
jobs in the background. This script sends notifications
about new interesting articles between 7 am and 9 pm, while only
performing updates during the night.
#!/bin/bash
PAPERSORTER_CMD=/path/to/papersorter
PAPERSORTER_DATADIR=/path/to/data
LOGFILE=background-updates.log
CURRENT_HOUR=$(date +%H)
cd $PAPERSORTER_DATADIR
$PAPERSORTER_CMD update -q --log-file $LOGFILE
if [ "$CURRENT_HOUR" -ge 7 ] && [ "$CURRENT_HOUR" -le 21 ]; then
$PAPERSORTER_CMD broadcast -q --log-file $LOGFILE
fi
Here is an example line for the crontab. It runs the update script on every hour at ten minutes past the hour.
10 * * * * /bin/bash /path/to/run-update.sh
Feedback and Updating the Model
To improve the model, provide more labels for the articles. First, extract the list of articles with the following command:
papersorter train -o model-temporary.pkl -f feedback.xlsx
This generates an Excel file, feedback.xlsx
, containing titles,
authors, prediction scores, and other details. Review each row and
fill in the label
column with 1
(interesting) or 0
(not interesting).
Leave it blank if unsure. Once you've labeled some articles, update
the feed database with:
papersorter feedback -i feedback.xlsx
Retrain the predictor with the updated labels using:
papersorter train
The new predictor is stored as model.pkl
, and your next feeds will
be assessed with the updated model.
Author
Hyeshik Chang hyeshik@snu.ac.kr
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file papersorter-0.2.tar.gz
.
File metadata
- Download URL: papersorter-0.2.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dae2bb1098b06550255394a12e70c871c21b1892ad9e56ce91658e5f8d39b80a |
|
MD5 | d0dab988c2e2b1c58f0c77f34ee9bb3a |
|
BLAKE2b-256 | 7c6fbf7468bfddb71ed7e0ac61d275f9d3ef9e2fd54eda0d6f2ce76385e44d8b |
File details
Details for the file papersorter-0.2-py3-none-any.whl
.
File metadata
- Download URL: papersorter-0.2-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41a16fa39c3ad7b246208aefd07c6a48b35f90e4d792324d407c4f0ff9e53a03 |
|
MD5 | c05479efe7a4a484b6a7d63bce736c33 |
|
BLAKE2b-256 | 1ed3f2a5eaf3e81a6d9bf9f278ca24b859ee1f0599d23b3f00d82596f5fb0372 |