Filter, rank, and summarize research-paper RSS feeds.
Project description
Paper Firehose
- Fetches academic RSS feeds, filters entries with per-topic regex, and writes results into SQLite databases. HTML pages with the results are rendered directly from the database.
- Results are ranked by cosine similarity to a set of user defined keywords. Configurable list of authors can get a ranking boost. So papers from your friends / competitors can be boosted in ranking.
- Highest ranked results are summarized by an LLM. For summarization to work, you need an OpenAI API key. Full text summarization uses Paper-qa.
- New search terms can be created by simply adding a yaml config file under:
config/topics/your_topic_name.yaml. Look at the other topics for guidance. - The repository ships a self-contained
system/bundle (configs, templates, sample topics). On first run these files are copied into your runtime data directory (default~/.paper_firehoseor the path inPAPER_FIREHOSE_DATA_DIR) so you can customize them with your own search terms, RSS feed sources, etc. - GitHub Actions–only YAML config lives in
github_actions_config/. When the workflow setsPAPER_FIREHOSE_DATA_DIR, those files are synced into${PAPER_FIREHOSE_DATA_DIR}/config/, keeping the published package generic while letting GitHub Actions runs use repo-specific overrides. So if you want to use this via GA, set your preferences in this directory. - Daily automation runs through GitHub Actions (
.github/workflows/pages.yml). The workflow restores database snapshots from thedatabranch, executes the CLI pipeline (filter → rank → abstracts → htmlplus optionalpqa_summary/email), and publishes refreshed HTML/SQLite artifacts back to GitHub Pages and thedatabranch.
Written using Python 3.11.
For dependencies check requirements.txt.
OpenAI API key is searched for in the openaikulcs.env file in the repo root or OPENAI_API_KEY environment variable.
CLI
-
Filter: fetch from RSS feed list, dedup, match regex, write DBs, render HTML
python cli/main.py -v filter [--topic TOPIC]- Backs up
all_feed_entries.dbandmatched_entries_history.db(keeps 3 latest backups). - Clears
papers.dbworking table before processing this run. - Fetches configured feeds for each topic, dedups by title against
all_feed_entries.db, filters by topic regex. - Writes matches to
papers.db(status='filtered'); optionally archives tomatched_entries_history.dbifoutput.archive: true. - Saves ALL processed entries (matched and non-matched) to
all_feed_entries.dbfor future dedup. - Renders per-topic HTML from
papers.db.
-
Rank: ranks fetched papers by title, wrt keywords in yaml config (optional)
python cli/main.py rank [--topic TOPIC]- Computes and writes
rank_scoreforpapers.dbentries using sentence-transformers. HTML files with ranked entries are generated. - Model selection: if
models/all-MiniLM-L6-v2exists, it is used; otherwise it falls back to the Hugging Face repo idall-MiniLM-L6-v2and downloads once into cache. You can vendor the model withpython scripts/vendor_model.py. - Scoring details: applies a small penalty for
ranking.negative_queriesmatches (title/summary). Optional boosts: per-topicranking.preferred_authorswithranking.priority_author_boost, and globalpriority_journal_boostfor feeds listed inpriority_journals.
-
Abstracts: try to fetch abstracts
python cli/main.py abstracts [--topic TOPIC] [--mailto you@example.com] [--limit N] [--rps 1.0]- Order: 1) Fill arXiv/cond-mat abstracts from
summary(no threshold) 2) Above-threshold: Crossref (DOI, then title) 3) Above-threshold: Semantic Scholar → OpenAlex → PubMed - Threshold: topic
abstract_fetch.rank_thresholdelse globaldefaults.rank_threshold. - Only topics with
abstract_fetch.enabled: trueare processed. - Writes to both
papers.db.entries.abstractandmatched_entries_history.db.matched_entries.abstract. - Rate limiting: descriptive User-Agent (includes
--mailtoor$MAILTO), respects Retry-After; default ~1 req/sec via--rps. - Populates the
entries.abstractcolumn; leaves other fields unchanged. - Contact email: if
--mailtois not provided, the command reads$MAILTOfrom the environment; if unset, it uses a safe default.
-
Summarize (optional)
python cli/main.py summarize [--topic TOPIC] [--rps 0.5]- Selects top entries per topic based on
llm_summary.score_cutoffandllm_summary.top_n, builds input strictly fromtitle + abstract(skips entries without an abstract), and calls the configured OpenAI chat model. - Writes summaries to
papers.db.entries.llm_summaryand, when present,matched_entries_history.db.matched_entries.llm_summary. - Uses
config.llmsettings. Supports JSON or plain-text responses; JSON is preferred and rendered with headings. - Note: This command only updates databases. Use
htmlto render pages.
-
pqa_summary (PDF-based summarization)
python cli/main.py pqa_summary [--topic TOPIC]- Selects preprints from arXiv in
papers.dbwithrank_score >= config.paperqa.download_rank_threshold, detects arXiv IDs, and downloads PDFs (polite arXiv API usage). - Runs paper-qa to summarize full text into JSON keys:
summary,methods. - Writes summaries to
papers.db.entries.paper_qa_summaryonly for the specific topic row the item was selected under (no longer cross-updating all topics for the same entry id), and tomatched_entries_history.db.matched_entries.paper_qa_summary. - Prunes archived PDFs older than ~30 days from
assets/paperqa_archive/after each run to keep storage manageable. - Note: This command only updates databases. Use
htmlto render pages.
-
HTML (render only; no fetching)
python cli/main.py html [--topic TOPIC]- Reads from
papers.dband generates, per topic:- Filtered page:
output.filename(if configured) - Ranked page:
output.filename_ranked(if configured and entries exist) - Summary page:
output.filename_summary(if configured). Content priority: PDF summaries → LLM summaries → abstract-only fallback; always ordered by rank.
- Filtered page:
Summary pages
-
When
output.filename_summaryis set for a topic, summary pages prefer content in this order:paper_qa_summary(PDF-based)llm_summary- Fallback to ranked fields (abstract → summary)
-
Entries are ordered by descending
rank_score. -
Purge
python cli/main.py purge --days Nremoves entries withpublished_datewithin the most recent N days in the seen entries DB.python cli/main.py purge --alldeletes all DB files and reinitializes schemas (no confirmation prompt).
-
Status
python cli/main.py status- Validates config, lists topics/feeds, and shows DB paths.
-
Email (Mailman digest)
python cli/main.py email [--topic TOPIC] [--mode auto|ranked] [--limit N] [--dry-run]- Builds an email‑friendly HTML digest from
papers.dband sends via SMTP (SSL). --mode autorenders a ranked‑style list directly frompapers.db;--mode rankedembeds the pre‑generated ranked HTML if present, otherwise falls back to the ranked‑style list.--dry-runwrites a preview HTML toassets/instead of sending.- Per‑recipient routing:
python cli/main.py email --recipients config/secrets/mailing_lists.yamlor setconfig.email.recipients_file.
Email configuration
Add an email section in config/config.yaml (secrets in a separate file):
email:
to: "LIST_ADDRESS@yourdomain" # Mailman list address
subject_prefix: "Paper Firehose" # Optional
from: "mail@xyz.com" # Defaults to smtp.username
smtp:
host: "mail.xyz.com"
port: 465
username: "youraccount"
password_file: "config/secrets/email_password.txt" # store only the password here
Notes
- Store the SMTP password in
config/secrets/email_password.txt(gitignored). - The command prefers to be run after
filter,rank,abstracts, andsummarizeso it can include LLM summaries when available.
Per‑recipient YAML (config/secrets/mailing_lists.yaml) Example:
recipients:
- to: "materials-list@nemeslab.com"
topics: ["primary", "perovskites"]
min_rank_score: 0.40
- to: "2d-list@nemeslab.com"
topics: ["rg", "2d_metals"]
min_rank_score: 0.35
Python API
You can call the main steps programmatically via paper_firehose.
Basics
import paper_firehose as pf- All functions default to
config/config.yaml; override withconfig_path="...".
Functions
pf.filter(topic=None, config_path=None): Runs the filter step for one topic or all.pf.rank(topic=None, config_path=None): Computes and writesrank_scorefor entries.pf.abstracts(topic=None, *, mailto=None, limit=None, rps=None, config_path=None): Fetches abstracts for above‑threshold entries and writes to DBs.pf.summarize(topic=None, *, rps=None, config_path=None): Runs LLM summarization for top‐ranked entries per topic, writing JSON (or text) tollm_summary.pf.pqa_summary(topic=None, *, rps=None, limit=None, arxiv=None, entry_ids=None, use_history=False, history_date=None, history_feed_like=None, config_path=None): Runs the paper-qa PDF summarizer with the same parameters as the CLI command.pf.html(topic=None, output_path=None, config_path=None): Regenerates the HTML pages directly frompapers.db. Whentopicis omitted, all topics are rendered to their configured filenames.pf.email(topic=None, *, mode='auto', limit=None, recipients_file=None, dry_run=False, config_path=None): Sends the digest email (or writes a preview ondry_run).pf.purge(days=None, all_data=False, config_path=None): Purges entries based on publication date. Whendaysis provided, removes entries from the most recent N days (including today) across all DBs; whenall_data=True, reinitializes all DBs.pf.status(config_path=None) -> dict: Returns configuration validity, available topics, enabled feed count, and database paths.
History Viewer
-
history_viewer.htmlis a static browser viewer forassets/matched_entries_history.db(tablematched_entries). -
By default it auto-loads the latest history DB from GitHub:
- Displayed:
https://github.com/zrbyte/paper-firehose/tree/data/assets/matched_entries_history.latest.db - The viewer automatically normalizes GitHub page links to their raw content (e.g.,
raw.githubusercontent.com) before fetching.
- Displayed:
-
You can override with a query param or local file:
history_viewer.html?db=<url>to load a specific remote DB- Use the file input or drag-and-drop a local
matched_entries_history.db
-
history_viewer_cards.htmlprovides a cleaner, card‑style view of history entries with just the key fields (title, authors, feed name, abstract, matched date). It supports the same controls and query params ashistory_viewer.html(topic, order, search,?db=<url>and file drag‑and‑drop) but focuses on readability instead of tabular data.
Continuous delivery via GitHub Actions
The workflow Build and Deploy (Runtime Data Dir Test) defined in .github/workflows/pages.yml drives the hosted pipeline:
- Checks out the repository and prepares a runtime directory at
$GITHUB_WORKSPACE/.paper_firehose(matching local defaults). - Restores cached SQLite databases from the
databranch so runs build on prior state rather than starting from empty tables. - Installs dependencies, seeds secrets, and runs the CLI sequence:
filter,rank,abstracts, optionalpqa_summary, optionalemail, and finallyhtmlto refresh HTML artifacts. - Publishes generated HTML via GitHub Pages and commits the updated databases plus rotated history snapshots back to the
databranch. - Accepts workflow dispatch inputs (
run_pqa,run_email) so heavier steps can be toggled without editing the workflow. Useful for debugging.
Architecture overview
- Three-DB architecture:
assets/all_feed_entries.db: Every fetched item (for deduplication).assets/matched_entries_history.db: All matched items across topics and runs (historical archive).assets/papers.db: Current-run working set (filtered → ranked → summarized).
- YAML-driven configuration for feeds and topics.
- HTML generated from
papers.dbso you can re-render without refetching. - Optional LLM summarization writes JSON summaries to DB and renders dedicated summary pages.
Bundled system assets
- The directory
src/paper_firehose/system/is the canonical source for starter config files, topic examples, HTML templates, and vendored models. paper_firehose.core.paths.ensure_data_dir()provisions a runtime directory (default~/.paper_firehose). Missing files are copied from the system bundle exactly once, enabling local overrides that survive upgrades.- To refresh a single asset, delete it from the runtime directory and rerun any CLI command; the latest version from
system/will be re-seeded automatically.
Databases
-
all_feed_entries.db(tablefeed_entries)- Keys:
entry_id(pk),feed_name(display name fromconfig.yaml),title,link. - Metadata:
summary,authors,published_date,first_seen,last_seen,raw_data(JSON). - Used only for dedup; populated after filtering completes.
- Keys:
-
matched_entries_history.db(tablematched_entries)- Keys:
entry_id(pk),feed_name,topics(CSV of topic names). - Metadata:
title,link,summary,authors,abstract(nullable),doi(nullable),published_date,matched_date,raw_data(JSON),llm_summary(nullable),paper_qa_summary(nullable). - Written only when a topic’s
output.archive: true.
- Keys:
-
papers.db(tableentries)- Primary key: composite
PRIMARY KEY(id, topic)so the same entry can appear once per topic. - Columns:
id,topic,feed_name(display name),title,link,summary,authors,abstract(nullable),doi(nullable),published_date,discovered_date,status(filtered|ranked|summarized),rank_score,rank_reasoning,llm_summary,raw_data(JSON).
- Primary key: composite
Notes
feed_nameis the human-readable name fromconfig.yaml -> feeds.<key>.name(e.g., "Nature Physics").doiis best-effort, fetched from RSS feed and can be found indoi,dc:identifier,prism:doi,id,link,summary,summary_detail.value, or embeddedcontent[].value. arXiv feeds may not include DOIs; no external lookup is performed (by design for now).abstractpopulated via Crossref API or publisher APIs.
Configuration
-
config/config.yaml(feeds, DB paths, defaults)- Each feed has a key and a display
name; the key is used in topic files, the name is stored in DBs. paperqa: settings for the arXiv downloader (Phase 1)download_rank_threshold: minimumrank_scoreto download (default 0.35)rps: requests/second throttle (default 0.3; ~1 request/3.3s per arXiv API guidance)max_retries: per-item retry attempts on transient errorsprompt: paper-qa question used for summarization; should instruct the model to return only JSON with keyssummary,methods(supports{ranking_query}placeholder)
- Each feed has a key and a display
-
config/topics/<topic>.yamlfeeds: list of feed keys fromconfig.yaml.filter.patternandfilter.fields: regex and fields to match (defaults includetitleandsummary).ranking: optionalquery,model, cutoffs, etc. (for the rank command).- Optional:
negative_queries(list),preferred_authors(list of names),priority_author_boost(float, e.g., 0.1).
- Optional:
output.filenameandoutput.filename_ranked: HTML output;archive: trueenables history DB writes.llm_summary: topic-level controls for LLM summarization.enabled: true|falseprompt: instruction given to the model. You can reference{ranking_query}and it will be replaced with the topic’sranking.query.score_cutoff: minimumrank_scoreto consider (0.0–1.0)top_n: hard cap on the number of entries considered (after filtering by score)- Works together with global
config.llmbelow.
-
config.llm(global model settings)model: preferred chat model idmodel_fallback: secondary model if the primary is unsupported/unavailableapi_key_env: environment variable name to read ifopenaikulcs.envis missingrps: default requests/second throttle for summarizationmax_retries: retry attempts per item on transient errors- Optional GPT‑5 parameters:
verbosity,reasoning_effort(used when the model starts withgpt-5)
Thank you to arXiv for use of its open access interoperability. The script does not serve pdf's or arxiv preprints, but serves up a link to the filtered article.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_firehose-0.1.0.tar.gz.
File metadata
- Download URL: paper_firehose-0.1.0.tar.gz
- Upload date:
- Size: 79.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0049c7f5a736825baad52d79565ec27ef70c42dd0a4e62dc98db128de0d66050
|
|
| MD5 |
cca7b0635b5be0b7700abed0b7bae01d
|
|
| BLAKE2b-256 |
a2076729536012e11fd3221c2c7eb25a560939c7a2284ee8914d447f30d98eb1
|
File details
Details for the file paper_firehose-0.1.0-py3-none-any.whl.
File metadata
- Download URL: paper_firehose-0.1.0-py3-none-any.whl
- Upload date:
- Size: 81.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab26bb2ff6c114f2150f5f790573b2214c309c60b1d614451fb790f4855a0f50
|
|
| MD5 |
2aa84af67e682b3aab111c72331a5da2
|
|
| BLAKE2b-256 |
4a1453f95ad2d97bbf144e2f9cfbf00927a1c65e57e42e416b02d27aeda05ab9
|