Skip to main content

Filter, rank, and summarize research-paper RSS feeds.

Project description

Paper Firehose

Filter, rank, and summarize research-paper RSS feeds. Stores results in SQLite and can generate HTML pages or an email digest. Optional full‑text (paper-qa) summaries of preprints from arXiv.

How to use:

Install locally

  • pip install paper_firehose
  • CLI entrypoint: paper-firehose
  • After install, run paper-firehose --help for available command line options.
  • In Jupyter or a Python file: import paper_firehose as pf

Configuration is done using only YAML text files. On first run the default YAML configs are copied into your runtime data directory (defaults to ~/.paper_firehose, override with PAPER_FIREHOSE_DATA_DIR) from src/paper_firehose/system/config. Edit those files to customize feeds and topics.

Set up an OpenAI API key environment variable for summarization to work.

Automated run using GitHub Actions

  • Fork the repo.
  • Copy the example yaml config files as is, from the src/paper_firehose/system/config folder to <repo root>/github_actions_config. Edit them to set up your own search terms and configuration.
  • Edit the pages.yml file in the schedule.cron part to set when the automated job runs.
  • Set up GitHub Secrets under Secrets and Variables / Actions. You don't need this step if you're only running the filter and rank commands. If you want the summarization and email alert functionality, you will need:
    • OPENAI_API_KEY. This is optional if you want to run the paper-qa full text summarization.
    • MAILING_LISTS_YAML. This contains the emails and other config that the email alert functionality needs. Just copy the contents of your mailing_lists.yaml file. This is a secret so you don't expose user info to the outside world in the repo.
    • SMTP_PASSWORD. The password for your email server.

The html command in GitHub Actions (see pages.yml), generates HTML files (name of which is set in the YAML config) with your results. The GitHub Actions runner then pushes these generated HTML files to https://<your GH username>.github.io/paper-firehose/<your results>.html, where they can be accessed on the open web.

Quick Start

  1. Seed and inspect config
paper-firehose status
  1. Run the core pipeline for all topics
paper-firehose filter
paper-firehose rank
paper-firehose abstracts --mailto you@example.com --rps 1.0
paper-firehose pqa_summary    # optional (needs OpenAI key)
paper-firehose html           # write HTML from DB

You can specify to run a specific topic, with the --topic YOUR_TOPIC option.

  1. Optional: full‑text summaries via paper‑qa and email digest
# Download arXiv PDFs for high‑ranked entries and summarize with paper‑qa
paper-firehose pqa_summary --topic perovskites --rps 0.33 --limit 20 --summarize

# Send a ranked email digest (SMTP config required)
paper-firehose email

CLI Reference

Global options

  • --config PATH use a specific YAML config (defaults to ~/.paper_firehose/config/config.yaml)
  • -v/--verbose enable debug logging

Commands

  • filter [--topic TOPIC]

    • Fetch RSS feeds, dedup by title, apply per‑topic regex, write matches to databases.
    • Backs up all_feed_entries.db and matched_entries_history.db, then clears current papers.db working table.
  • rank [--topic TOPIC]

    • Compute rank_score using Sentence‑Transformers similarity to ranking.query.
    • Optional boosts: per‑topic ranking.preferred_authors (priority_author_boost) and global priority_journals (priority_journal_boost).
    • Models can be vendored under the data dir models/. The default alias all-MiniLM-L6-v2 is supported.
  • abstracts [--topic TOPIC] [--mailto EMAIL] [--limit N] [--rps FLOAT]

    • Fetch abstracts above a rank threshold (topic abstract_fetch.rank_threshold or global defaults.rank_threshold).
    • Uses polite rate limits; sets a descriptive arXiv/Crossref User‑Agent including your contact email.
  • summarize [--topic TOPIC] [--rps FLOAT]

    • LLM summaries of abstracts for top‑ranked entries using config.llm and per‑topic llm_summary settings.
    • Requires an OpenAI API key.
  • html [--topic TOPIC]

    • Generate HTML page(s) directly from papers.db. For a single topic, output.filename is used unless you override via the Python API (see below).
  • pqa_summary [--topic TOPIC] [--rps FLOAT] [--limit N] [--arxiv ID|URL ...] [--entry-id ID ...] [--use-history] [--history-date YYYY-MM-DD] [--history-feed-like STR] [--summarize]

    • Download arXiv PDFs for ranked entries (or explicit IDs/URLs) with polite rate limiting, archive them, optionally run paper‑qa, and write normalized JSON into DBs. Old pdfs are discarded from the archive. We don't do scraping.
    • Accepts --arxiv values like 2501.12345, 2501.12345v2, https://arxiv.org/abs/2501.12345, or https://arxiv.org/pdf/2501.12345.pdf.
  • email [--topic TOPIC] [--mode auto|ranked] [--limit N] [--recipients PATH] [--dry-run]

    • Send a compact HTML digest via SMTP (SSL). In dry‑run, writes a preview HTML to the data dir.
    • --recipients points to a YAML file with per‑recipient overrides (see Configuration).
  • purge (--days N | --all)

    • Remove entries by date from databases, or clear all and reinitialize schemas (--all).
  • status

    • Validate configuration and list available topics, enabled feeds, and database paths.

Python API

Import functions directly from the package for programmatic workflows:

from paper_firehose import (
    filter, rank, abstracts, summarize, pqa_summary, email, purge, status, html,
)

# Run steps
filter(topic="perovskites")
rank(topic="perovskites")
abstracts(topic="perovskites", mailto="you@example.com", rps=1.0)
summarize(topic="perovskites", rps=0.5)

# Generate HTML (single topic can override output path)
html(topic="perovskites")
html(topic="perovskites", output_path="results_perovskites.html")

# Paper‑QA download + summarize
pqa_summary(topic="perovskites", rps=0.33, limit=10)
pqa_summary(arxiv=["2501.12345", "https://arxiv.org/abs/2501.12345v2"], summarize=True)

# Email digest
email(limit=10, dry_run=True)

# Maintenance
purge(days=7)
info = status()
print(info["valid"], info["topics"])  # dict with config + paths

Configuration

Runtime data dir

  • Default: ~/.paper_firehose on your home folder on MacOS or Linux. On Windows it's: C:\Users\<YourUser>\.paper_firehose.
  • Override with PAPER_FIREHOSE_DATA_DIR environment variable
  • First run seeds config/, templates/, and optional models/ from the bundled system/ directory.

Files to edit

  • config/config.yaml: global settings (DB paths, feeds, LLM, paper‑qa, defaults, optional email/SMTP)
  • config/topics/<topic>.yaml: topic name/description, feeds, regex filter, ranking, abstract fetch, LLM summary, and output filenames
  • config/secrets/: secret material that should not be committed. These secrets can be either stored as *.env files or as environment variables.
    • openaikulcs.env: OpenAI API key for summarize and pqa_summary
    • email_password.env: SMTP password (referenced by email.smtp.password_file)
    • mailing_lists.yaml: optional per‑recipient overrides for email:
      recipients:
        - to: person@example.com
          topics: [perovskites, batteries]   # subset of topics for this person
          mode: ranked                       # currently always renders ranked from DB
          limit: 10                          # per‑recipient cap
          min_rank_score: 0.3                # optional cutoff
      

Key config fields

  • filter.pattern: This is the regular expression that does the heavy lifting of "casting a wide net" and trying to capture papers from the RSS feeds which are related to your topic of interest. The point of using regular expressions is that they can capture the many ways in which certain terms can be written. For example: the regexp (scan[a-z]+ tunne[a-z]+ micr[a-z]+) will match “scanning tunneling microscopy” as well as “scanned tunneling microscopies”, as well as the British and US English spellings of 'tunnelling' and 'tunneling'. The results of the regexp match can then be ranked by similarity to the keyword list under ranking.query. It takes a bit of thought to set this up, but it is powerful.
  • ranking.query: List of keywords that are used by an embedding model to rank the results. Asking an LLM to generate regex patterns from your keywords might be an easy way to set up filter.pattern.
  • feeds: mapping of feed keys to {name, url, enabled}. Feed keys are referenced in topic files; name is stored in DBs and used in HTML.
  • priority_journals and priority_journal_boost: optional global score boost by feed key.
  • Topic ranking: query, model, optional negative_queries, preferred_authors, priority_author_boost.
  • Topic output: filename, filename_ranked, optional filename_summary, archive: true|false.
  • Topic llm_summary: enabled, prompt (can reference {ranking_query}), score_cutoff, top_n.
  • paperqa: download_rank_threshold, rps (≤ 0.33 recommended), max_retries, and prompt for JSON‑only answers.
  • llm: model, model_fallback, api_key_env, default rps, max_retries, plus optional GPT‑5 verbosity and reasoning_effort.

Environment variables

  • PAPER_FIREHOSE_DATA_DIR select/override the runtime data location
  • OPENAI_API_KEY (or config.llm.api_key_env) for summarize
  • MAILTO used for polite arXiv/Crossref User‑Agent when not specified on CLI

Data & Outputs

Databases (under the data dir unless absolute paths are used)

  • all_feed_entries.db (table feed_entries): every fetched item for deduplication
  • matched_entries_history.db (table matched_entries): historical archive of matches, optional JSON summaries
  • papers.db (table entries): current‑run working set with status, rank_score, llm_summary, paper_qa_summary

HTML

  • Generated by the html command from papers.db using templates in templates/. Ranked and LLM‑summary pages are produced when configured.

Email

  • Requires email.smtp config: host, port, username, and either password or password_file. Uses SSL.

Future dev

  • Improve the history browser HTML interface.
  • Run ranking on the historic database, with a unique query. To search for specific papers.
  • The abstract summarizer doesn't make much sense at this point, might remove it in the future.

Final notes

  • Python 3.11+ recommended. See pyproject.toml for dependencies.
  • Thank you to arXiv for use of its open access interoperability. This project links to arXiv/publisher pages and does not serve PDFs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_firehose-0.1.1.tar.gz (76.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_firehose-0.1.1-py3-none-any.whl (81.4 kB view details)

Uploaded Python 3

File details

Details for the file paper_firehose-0.1.1.tar.gz.

File metadata

  • Download URL: paper_firehose-0.1.1.tar.gz
  • Upload date:
  • Size: 76.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_firehose-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8b3084e886d7b2cbc38217db166cab4fd203ed19b7fd6431185eea82f6ce12b6
MD5 80170f4edfd82b3314f70fb9248f7971
BLAKE2b-256 a06b5f31b8be0c89746ba3d8e59864aba3124a633e130aacf1dcf66b00d795bf

See more details on using hashes here.

File details

Details for the file paper_firehose-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: paper_firehose-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 81.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper_firehose-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 78cf0f25486bccd0070a7b4ecf9945a7ee9eddf17f3f8b5b7837df67e158f0a9
MD5 42b1dc918aed616e866d24e188e52073
BLAKE2b-256 5ee3d26130752e9d56f4fffa897888a0d949ee02942268a0d5a077bc6f759b02

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page