Skip to main content

A CLI research agent for AI-related paper search, code discovery, PDF collection, and bilingual reports.

Project description

PaperPilot

English | 中文

PaperPilot - AI literature review agent

PyPI Python License Release CLI LLM Reports Workflow

PaperPilot is a CLI research agent for AI-related literature review. It turns a natural-language research request into a verified paper corpus, code/PDF collection, evidence-grounded synthesis, and bilingual reports in Markdown, HTML, and PDF.

It is designed as a file-system based research workflow, not a chatbot. Each run creates a self-contained run folder with state, logs, intermediate artifacts, evidence checks, and final reports.

Highlights

  • Natural-language research intake with LLM-assisted query understanding.
  • Rich interactive CLI with startup model/source status, /model, /sources, and structured confirmation panels.
  • Layered Source Registry with arXiv, Semantic Scholar, OpenAlex, Crossref, OpenReview, PubMed, Europe PMC, bioRxiv, medRxiv, DBLP, ACL Anthology, and optional API-key sources.
  • Local corpus import with --user-corpus for PDF, BibTeX, RIS, Markdown, and text files.
  • Research protocol generation with inclusion/exclusion criteria and negative keywords.
  • Corpus normalization, DOI/arXiv/title-similarity deduplication, ranking, and relevance screening.
  • Code repository detection for GitHub, GitLab, Hugging Face, and project pages.
  • Open-access PDF download only; no paywall bypassing.
  • Full-text extraction for downloaded PDFs.
  • Prompt Registry, Tool Registry, Capability Registry, and event logging.
  • Evidence ledger that maps report-level claims to numbered paper citations.
  • Review-agent checks for source verification, relevance, citation compliance, and overclaiming risk.
  • Canonical bilingual report model with aligned Chinese/English Markdown, HTML, and PDF outputs.

Installation

From PyPI:

python -m pip install paperpilot -i https://pypi.org/simple

For local development:

git clone https://github.com/CHB-learner/PaperPilot.git
cd PaperPilot
python -m pip install -e .

LLM Configuration

PaperPilot requires an OpenAI-compatible LLM configuration for query understanding, planning, screening, synthesis, and report generation.

Interactive setup:

PaperPilot

On first run, PaperPilot creates an editable template at ~/.paperpilot/config.json if the file does not already exist:

{
  "active": "default",
  "profiles": {
    "default": {
      "api_key": "",
      "base_url": "",
      "model": "gpt-5.2"
    }
  },
  "sources": {
    "core": {"enabled": null, "api_key": "", "base_url": ""},
    "lens": {"enabled": null, "api_key": "", "base_url": ""},
    "ieee": {"enabled": null, "api_key": "", "base_url": ""},
    "springer": {"enabled": null, "api_key": "", "base_url": ""},
    "elsevier": {"enabled": null, "api_key": "", "base_url": ""},
    "dimensions": {"enabled": null, "api_key": "", "base_url": ""}
  }
}

You can edit this file directly. Leave optional source keys empty if you do not have access. enabled: null means PaperPilot will enable that source automatically only after a key is configured.

Manual setup:

PaperPilot config set --base-url https://api.deepseek.com --model deepseek-chat
PaperPilot config import ./api.json
PaperPilot config list
PaperPilot config use deepseek
PaperPilot config show
PaperPilot --doctor

Optional source API keys:

PaperPilot sources list
PaperPilot sources config core
PaperPilot sources config lens
PaperPilot sources enable core
PaperPilot sources test core

Inside interactive mode, use /sources to view the same source/API status table without leaving the session.

Health checks:

PaperPilot --doctor

The doctor command checks the active LLM connection and any optional paper sources that have API keys configured. Interactive mode also runs a compact doctor check on startup; use /doctor inside the shell to run it again.

Where to get optional source API keys:

Source How to get access
CORE Request a key from the CORE API page.
Lens.org Request Scholarly API access or manage tokens from the Lens API documentation.
IEEE Xplore Register and request an application key via IEEE Xplore API Getting Started.
Springer Nature Use the Springer Nature developer portal for API documentation and keys.
Elsevier / Scopus Start from the Elsevier Developer Portal and the Scopus APIs getting started guide.
Dimensions See Dimensions API access. Dimensions API access usually requires an institutional subscription or eligible research access.

Configuration is stored in:

~/.paperpilot/config.json

Configuration priority:

  1. Environment variables: OPENAI_API_KEY, OPENAI_BASE_URL, OPENAI_MODEL
  2. User config: ~/.paperpilot/config.json
  3. Legacy project file: llmapi.txt

Do not commit ~/.paperpilot/config.json, api.json, llmapi.txt, .env, or any file containing API keys.

Quick Start

Interactive mode:

PaperPilot

The interactive shell shows the active LLM profile, model API status, free-source coverage, optional API-key source coverage, and quick commands:

/model      manage LLM profiles
/sources    inspect enabled and optional search sources
/doctor     check LLM and configured source APIs
/help       show the startup guide again
exit        quit

Command mode:

PaperPilot "RNA inverse folding sequence design" \
  --auto-confirm \
  --max-papers 50 \
  --since-year 2021 \
  --github-filter required \
  --sources auto \
  --mode apa \
  --quality balanced

Use local papers as seed corpus:

PaperPilot "RNA inverse folding sequence design" \
  --auto-confirm \
  --user-corpus ./papers \
  --user-corpus references.bib

Skip PDF downloads:

PaperPilot "vision language model" --auto-confirm --no-download

Inspect or rerun a task:

PaperPilot inspect runs/<task-id>
PaperPilot resume runs/<task-id>

Architecture

PaperPilot follows a state-machine workflow:

Intake -> Protocol -> Search -> Corpus -> Screening -> Verification -> Synthesis -> Review -> Report
flowchart LR
  U[User request<br/>topic + params + local corpus] --> C[Run context<br/>task/state/events]
  C --> P[Prompt Registry]
  P --> QA[Query Understanding Agent]
  QA --> PL[Planner Agent]
  PL --> RP[Research Protocol Agent]
  RP --> ST[Source Registry<br/>arXiv / S2 / OpenAlex / Crossref / OpenReview<br/>PubMed / Europe PMC / bioRxiv / medRxiv / DBLP / ACL]
  U --> LC[Local Corpus Import]
  LC --> CB[Corpus Builder]
  ST --> CB
  CB --> RJ[Relevance Judge<br/>core / adjacent / exclude]
  RJ --> VF[Verification + PDF Tools]
  VF --> LM[Literature Matrix]
  LM --> SA[Synthesis Agent]
  SA --> QG[Quality Gate + Reflection]
  QG --> EL[Evidence Ledger<br/>claim -> citation]
  EL --> RA[Review Agents<br/>source / citation / overclaiming]
  RA --> CR[Canonical Report]
  CR --> OUT[ZH/EN Markdown<br/>ZH/EN HTML<br/>ZH/EN PDF]

Default free sources include arXiv, Semantic Scholar, OpenAlex, Crossref, OpenReview, PubMed, Europe PMC, bioRxiv, medRxiv, DBLP, and ACL Anthology. Optional API-key sources include CORE, Lens.org, IEEE Xplore, Springer Nature, Elsevier/Scopus, and Dimensions.

The repository also includes an HTML architecture overview:

  • paperpilot_agent_flow.html

Output Artifacts

Each run writes a folder under runs/<task-id>/ unless --output-dir is provided.

Core run files:

  • task.json: task metadata and parameters.
  • state.json: stage status.
  • events.jsonl: stage event stream.
  • manifest.json: generated artifact list.
  • prompt_manifest.json: versioned prompt roles and required JSON keys.
  • registries.json: built-in ToolRegistry and CapabilityRegistry.
  • source_diagnostics.json: enabled sources, returned counts, and source-level errors.

Search and corpus files:

  • query_understanding.md: keyword interpretation and ambiguity analysis.
  • plan.json: search plan and diversified queries.
  • protocol.json: research question, scope, inclusion/exclusion criteria, negative keywords.
  • metadata.json: normalized raw search candidates.
  • user_corpus_log.json: local corpus import log.
  • corpus.json: screened full corpus.
  • core_papers.json: core papers.
  • adjacent_papers.json: adjacent papers.
  • excluded_papers.json: excluded papers and reasons.
  • ranked_papers.json: final report-view papers.

Evidence and quality files:

  • verification.json: DOI, URL, PDF, and code status.
  • download_log.json: PDF download status.
  • fulltext/: extracted PDF text.
  • paper_notes.json: full-text extraction metadata.
  • literature_matrix.json: method/task/evidence matrix.
  • synthesis.json: field overview, method taxonomy, paper summaries, trends, gaps.
  • quality_gate.json: pass/retry/needs-user-attention verdict.
  • reflection.json: search quality reflection and retry hints.
  • evidence_ledger.json: claim-level evidence ledger.
  • review_agent_findings.json: review-agent checks.

Final reports:

  • report.canonical.json: shared bilingual report model and citation map.
  • report.zh.md
  • report.en.md
  • report.zh.html
  • report.en.html
  • report.zh.pdf
  • report.en.pdf
  • pdfs/: downloaded open-access PDFs.

GitHub / Code Filter

PaperPilot "retrieval augmented generation" --auto-confirm --github-filter required

Filter modes:

  • any: keep all papers and annotate code availability.
  • required: final report view keeps papers with detected public code links; full screened corpus is still saved.
  • none: final report view keeps papers without detected public code links.

CLI Options

--max-papers INT                 maximum papers in final report view
--since-year INT                 prefer papers since this year
--github-filter any|required|none
--github-search-limit INT        active GitHub search limit
--no-download                    skip PDF downloads
--pdf-limit INT                  maximum PDFs to download
--user-corpus PATH               import local corpus path; repeatable
--mode quick|apa|systematic
--interaction auto|gated
--quality fast|balanced|strict
--include-adjacent               include adjacent papers in matrix/appendix
--sources auto|all|core|biomed|cs|configured
--enable-source SOURCE           enable one additional source; repeatable
--disable-source SOURCE          disable one source; repeatable

Development

Run tests:

python -m unittest discover -s tests
python -m compileall literature_agent

Build locally:

python -m pip install build twine
python -m build
python -m twine check dist/*

Publish to PyPI:

python -m twine upload dist/*

Open Source Notes

Before pushing to GitHub:

  • Make sure .gitignore is present.
  • Do not commit API keys, local run outputs, build artifacts, or virtual environments.
  • Add a LICENSE file before calling the project open source in a strict legal sense.
  • If any PyPI or LLM token was ever committed, revoke it immediately and create a new one.

Suggested first push:

git init
git add README.md README.zh-CN.md pyproject.toml literature_agent tests paperpilot_agent_flow.html .gitignore LICENSE
git commit -m "Initial open source release"
git branch -M main
git remote add origin https://github.com/CHB-learner/PaperPilot.git
git push -u origin main

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperpilot-1.3.3.tar.gz (90.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperpilot-1.3.3-py3-none-any.whl (90.0 kB view details)

Uploaded Python 3

File details

Details for the file paperpilot-1.3.3.tar.gz.

File metadata

  • Download URL: paperpilot-1.3.3.tar.gz
  • Upload date:
  • Size: 90.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for paperpilot-1.3.3.tar.gz
Algorithm Hash digest
SHA256 33ac19ef4fa765d98e3f2fd6d21d493bce989120af0bc5f39039a4d9fea81fec
MD5 7f019f7dd466d060c116d6608a10cc9a
BLAKE2b-256 bfdbc8f6ac00e8667e4fc8644f5da9ad39f9c87ee030f94f29af4ef2cab4408d

See more details on using hashes here.

File details

Details for the file paperpilot-1.3.3-py3-none-any.whl.

File metadata

  • Download URL: paperpilot-1.3.3-py3-none-any.whl
  • Upload date:
  • Size: 90.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for paperpilot-1.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2cd52d083069325b674a69d4f6380abd0bebb5b8614c2e2ab6b24afba68e858c
MD5 4cbcba36e8a167b60e0c88c416d83b4e
BLAKE2b-256 3cb62ea02b90f3797a33510878bd655e40dc2b901cadcc320ac53b24c4417c7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page