Skip to main content

No project description provided

Project description

🐝 PaperBee

logo

PaperBee is a Python application designed to automatically search for new scientific papers and post them to your favorite channels.
Currently supported platforms:

  • 🟣 Slack
  • 🟢 Zulip
  • 🔵 Telegram

🚀 How Does It Work?

paperbee_pipeline

PaperBee queries scientific papers using user-specified keywords from PubMed and preprint services, relying on the findpapers library.
Papers are then filtered either manually via a command-line interface or automatically via an LLM.
The filtered papers are posted to a Google Sheet and, if desired, to Slack, Telegram, or Zulip channels. PaperBee is easy to setup and configure with a simple yml file.


📦 Installation

1. Download the Code and Install Dependencies

pip install paperbee

📝 Setup Guide

1. Google Sheets Integration

  1. Create a Google Service Account:
    Official guide
    Needed to write found papers to a Google Spreadsheet.
  2. Create a JSON Key:
    Official guide
    Download and store the JSON file securely.
  3. Enable Google Sheets API:
    In Google Cloud Console, enable the Google Sheets API for your service account.
  4. Create a Google Spreadsheet:
    You can copy this template.
    The sheet must have columns: DOI, Date, PostedDate, IsPreprint, Title, Keywords, Preprint, URL.
    The sheet name must be Papers.
  5. Share the Spreadsheet:
    Add the service account email as an Editor.

2. 🔑 Get NCBI API Key

PaperBee uses the NCBI API to fetch papers and DOIs from PubMed.
Get your free API key here.


3. 📢 Setup Posting Channels

You must set up at least one of the three platforms below.

🟣 Slack (optional)

  1. Create a Slack App (choose "From an app manifest").
  2. Choose your workspace.
  3. Copy the contents of manifest.json into the manifest box.
  4. Review and create the app.
  5. Install to Workspace and allow permissions.
  6. In OAuth & Permissions, copy the Bot User OAuth Token.
  7. In Basic Information, create an app-level token with connections:write scope.
  8. Set SLACK_CHANNEL_ID to your desired channel's ID.

Update the SLACK variables in the config.yml file.

🔵 Telegram (optional)

  1. Create a Telegram bot Follow the instructions here.
  2. Create a channel or group, add the bot as admin.
  3. Use @myidbot to get the channel ID.

Update the TELEGRAM variables in the config.yml file.

🟢 Zulip (optional)

  1. Create a Zulip bot and download the zuliprc file.
  2. Create a stream and subscribe the bot.

Update the ZULIP variables in the config.yml file.


4. 🤖 Setup LLM for Automated Filtering (optional, but recommended)

If you want to use LLM filtering, remember to add a filtering_prompt.txt file.
See Setup Query and LLM Filtering Prompt.

OpenAI API

Ollama (Open Source LLMs)

Update the LLM variables in the config.yml file. (LLM_PROVIDER, LANGUAGE_MODEL, OPENAI_API_KEY)


⚙️ Configuration

PaperBee uses a YAML configuration file to specify all arguments.
Copy and customize the template below as config.yml:

Example config.yml

GOOGLE_SPREADSHEET_ID: "your-google-spreadsheet-id"
GOOGLE_CREDENTIALS_JSON: "/path/to/your/google-credentials.json"
NCBI_API_KEY: "your-ncbi-api-key"

# path to the local root directory where query prompts and files are stored
LOCAL_ROOT_DIR: "/path/to/local/root/dir"

# Queries. You can set either only "query" to use in all databases or query_biorxiv and query_pubmed_arxiv.
# Note that biorxiv only accept OR boolean operator while pubmed and arxiv also accept AND and AND NOT, this is why tje two queries are separated.
# More info: https://github.com/jonatasgrosman/findpapers?tab=readme-ov-file#search-query-construction
query: "[AI for cell trajectories] OR [machine learning for cell trajectories] OR [deep learning for cell trajectories] OR [AI for cell dynamics] OR [machine learning for cell dynamics] OR [deep learning for cell dynamics]"
query_biorxiv: "[AI for cell trajectories] OR [machine learning for cell trajectories] OR [deep learning for cell trajectories] OR [AI for cell dynamics] OR [machine learning for cell dynamics] OR [deep learning for cell dynamics]"
query_pubmed_arxiv: "([single-cell transcriptomics]) AND ([Cell Dynamics]) AND ([AI] OR [machine learning] OR [deep learning]) AND NOT ([proteomics])"

# LLM Filtering (optional)
LLM_FILTERING: true
LLM_PROVIDER: "openai"
LANGUAGE_MODEL: "gpt-4o-mini"
OPENAI_API_KEY: "your-openai-api-key"
# Describe what are your interests and what kind of papers are relevant to your lab.
# Change lab focus and interests to your own. Feel free to add more details and examples, but leave the last sentence as is.
FILTERING_PROMPT: "You are a lab manager at a research lab focusing on machine learning methods development for single-cell RNA sequencing. Lab members are interested in developing methods to model cell dynamics. You are reviewing a list of research papers to determine if they are relevant to your lab. Please answer 'yes' or 'no' to the following question: Is the following research paper relevant?"

# Slack configuration
SLACK:
  is_posting_on: true
  bot_token: "your-slack-bot-token"
  channel_id: "your-slack-channel-id"
  app_token: "your-slack-app-token"

# Telegram configuration
TELEGRAM:
  is_posting_on: true
  bot_token: "your-telegram-bot-token"
  channel_id: "your-telegram-channel-id"

# Zulip configuration
ZULIP:
  is_posting_on: false
  prc: "path-to-your-zulip-prc"
  stream: "your-zulip-stream"
  topic: "your-zulip-topic"




SLACK_TEST_CHANNEL_ID: "your-slack-test-channel-id" # not required so left outside of dictionary
TELEGRAM_TEST_CHANNEL_ID: "your-slack-test-channel-id" # not required so left outside of dictionary
GOOGLE_TEST_SPREADSHEET_ID: "your-google-test-spreadsheet-id" # not required so left outside of dictionary

📄 Example Query and Prompt

query

If specifying a list of keyword is enough, you can simply fit one query for all databases. Example:

[AI for cell trajectories] OR [machine learning for cell trajectories] OR [deep learning for cell trajectories] OR [AI for cell dynamics] OR [machine learning for cell dynamics] OR [deep learning for cell dynamics]

Both Arxiv and Pubmed allow for more refined queries. If you want to fine-tune the queries for pubmed and arxiv which allow for both AND and AND NOT boolean operators, then you will need to split the queries in two (read below).

query_biorxiv

This database has more requirements, so if your query is complex, you have to set a separate simple query for biorxiv, and a complex query for everything else. See findpapers documentation for more details. TLDR:

  • Only 1-level grouping is supported: no round brackets inside round brackets
  • Only OR connectors between parenthesis are allowed, no () AND ()!
  • AND NOT is not allowed
  • All connectors must be either OR or AND. No mixing!

Here's an example of a valid query:

[AI for cell trajectories] OR [machine learning for cell trajectories] OR [deep learning for cell trajectories] OR [AI for cell dynamics] OR [machine learning for cell dynamics] OR [deep learning for cell dynamics]

query_pubmed_arxiv.txt

Pubmed and Arxiv don't have such requirements, so query can be more complex:

([single-cell transcriptomics]) AND ([Cell Dynamics]) AND ([AI] OR [machine learning] OR [deep learning]) AND NOT ([proteomics])

filtering_prompt

Simply describe your lab interests, which type of papers you want to see and which you don't. The more details, the better! But always leave the last sentence with the question as is. Here is an example:

You are a lab manager at a research lab focusing on machine learning methods development for single-cell RNA sequencing. Lab members are interested in developing methods to model cell dynamics. You are reviewing a list of research papers to determine if they are relevant to your lab. Please answer 'yes' or 'no' to the following question: Is the following research paper relevant?

▶️ Running the Bot

When everything is set up, run the bot with:

paperbee post --config /path/to/config.yml --interactive --since 10
  • --config : Path to your YAML configuration file.
  • --interactive : (Optional) Use CLI for manual filtering.
  • --since : (Optional) How many days back to search for papers (default: last 24h).

See daily_posting.py for an example of running search from Python.


🗂️ Project Structure

manifest.json

Configuration for Slack apps.
With a manifest, you can create or adjust an app with a pre-defined configuration.

src/PaperBee/papers

Classes to fetch, format, and post papers, and update the Google Sheet.

  • utils.py – Preprocess findpapers output, extract DOIs.
  • google_sheet.py – Update/check the Google Sheet.
  • llm_filtering.py – Filter papers with LLMs.
  • cli.py – Interactive CLI filtering.
  • slack_papers_formatter.py – Format and post to Slack.
  • zulip_papers_formatter.py – Format and post to Zulip.
  • telegram_papers_formatter.py – Format and post to Telegram.
  • papers_finder.py – Main wrapper class.
  • daily_posting.py – CLI entry point.

🧪 Running Tests (Optional)

You can set up test channels for Slack/Telegram or run tests in production channels.
Set the following variables in your config.yml:

  • TELEGRAM_TEST_CHANNEL_ID – Telegram test channel ID.
  • SLACK_TEST_CHANNEL_ID – Slack test channel ID.
  • GOOGLE_TEST_SPREADSHEET_ID – Test spreadsheet ID. Don't use a production spreadsheet!

Install extra dependencies:

pip install pytest-asyncio

or with Poetry:

poetry install --with dev

Run the tests:

pytest

Reference

We have submitted the paper to the Journal of Open Source Software (JOSS). You can check the paper text in the paper directory or find a rendered PDF in the GitHub actions. While the paper is under revisions, feel free to cite this repo as:

@misc{shitov_patpy_2024,
  author = {Lucarelli, Daniele and Shitov, Vladimir A. and Saur, Dieter and Zappia, Luke and Theis, Fabian J.},
  title = {PaperBee: An Automated Daily Digest Bot for Scientific Literature Monitoring},
  year = {2025},
  url = {https://github.com/theislab/paperbee},
  note = {Version 1.0.0}
}

Enjoy using 🐝 PaperBee!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperbee-1.0.0.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperbee-1.0.0-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file paperbee-1.0.0.tar.gz.

File metadata

  • Download URL: paperbee-1.0.0.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.10.18 Linux/5.4.0-170-generic

File hashes

Hashes for paperbee-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7f822a979e36027cc0068064e0f1d7d821f746396d0cac635ed52a2f750b1385
MD5 7c156803190f519d900c568cba2813b7
BLAKE2b-256 b0ce673ec177e9eb6ab7213888141dbe86c191f8355f8ff4c0ffe3b5333636a4

See more details on using hashes here.

File details

Details for the file paperbee-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: paperbee-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 26.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.10.18 Linux/5.4.0-170-generic

File hashes

Hashes for paperbee-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d908686d42be44e64dd166eee281e90fa81e0228ffb7f699e52be9a89c61216b
MD5 559bba9243685244fd2734f272e23969
BLAKE2b-256 6ec8bb435fbd31c7c5cf1d6791f58349e4f21cf6dc4e2fccd58f25c7fee285b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page