A multi-source pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, and more.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

grammy.jiang

These details have not been verified by PyPI

Project description

research-pipeline

research-pipeline is a deterministic Python 3.12+ workflow for finding, screening, downloading, converting, and synthesizing academic papers. It is useful when you need an auditable literature review, not just a one-off paper search.

It ships as both a Typer CLI and an MCP server for agent-driven research.

What It Does
Installation
Quick Start
Pipeline
CLI Commands
Readable Reports
MCP Server
AI Skill And Agents
Configuration
Artifacts
Development

What It Does

Searches arXiv, Google Scholar, Semantic Scholar, OpenAlex, DBLP, and HuggingFace daily papers, with cross-source deduplication.
Screens candidates with BM25 heuristics, optional SPECTER2 semantic reranking, optional LLM judging, diversity-aware selection, and feedback-adjusted weights.
Downloads PDFs politely with rate limits, retry, caching, and manifest tracking.
Converts PDFs to Markdown through local or cloud backends: Docling, Marker, PyMuPDF4LLM, MinerU, Mathpix, Datalab, LlamaParse, Mistral OCR, or OpenAI Vision.
Supports two-tier conversion: fast rough conversion for all papers and high-quality fine conversion for selected papers.
Extracts structured chunks, bibliography data, citation contexts, and retrieval indexes from converted papers.
Produces schema-first per-paper extraction records, design-neutral cross-paper synthesis, confidence scoring, evidence aggregation, BibTeX exports, templated Markdown reports, and self-contained HTML reports.
Adds research quality layers: citation expansion, quality scoring, claim decomposition, knowledge graph ingestion, report validation, multi-run comparison, coherence checks, memory consolidation, blinding audits, Pass@k / Pass[k] metrics, case-based strategy reuse, KG quality checks, adaptive stopping, and 4-layer confidence calibration.

Installation

# Base package
pip install research-pipeline

# Recommended local converter
pip install 'research-pipeline[docling]'

# Other local converters
pip install 'research-pipeline[marker]'       # high accuracy, GPL-3.0
pip install 'research-pipeline[pymupdf4llm]'  # fast CPU conversion, AGPL
pip install 'research-pipeline[mineru]'       # scientific PDF parser

# Search and reranking extras
pip install 'research-pipeline[scholar]'      # Google Scholar via scholarly
pip install 'research-pipeline[serpapi]'      # Google Scholar via SerpAPI
pip install 'research-pipeline[reranker]'     # sentence-transformers reranker

# Cloud conversion extras
pip install 'research-pipeline[datalab]'
pip install 'research-pipeline[llamaparse]'
pip install 'research-pipeline[mistral-ocr]'
pip install 'research-pipeline[openai-vision]'

# Development checkout
uv sync --extra dev --extra docling --extra scholar --extra reranker

Quick Start

# Fast abstract-only pass
research-pipeline run --profile quick "transformer architectures for time series"

# Full evidence-backed pipeline
research-pipeline run "local memory systems for AI agents"

# Deep profile with quality, expansion, claim analysis, and TER gap filling
research-pipeline run --profile deep "comprehensive survey of AI memory systems"

# Search every configured source family
research-pipeline run --source all "long-context retrieval augmented generation"

Run stages independently when you want control over review points:

research-pipeline plan "multimodal RAG for long-document QA"
research-pipeline search --run-id <RUN_ID> --source all
research-pipeline screen --run-id <RUN_ID> --diversity
research-pipeline quality --run-id <RUN_ID>
research-pipeline download --run-id <RUN_ID>
research-pipeline convert-rough --run-id <RUN_ID>
research-pipeline convert-fine --run-id <RUN_ID> --paper-ids "2401.12345"
research-pipeline extract --run-id <RUN_ID>
research-pipeline summarize --run-id <RUN_ID>
research-pipeline report --run-id <RUN_ID> --template structured_synthesis
research-pipeline validate --run-id <RUN_ID>

Pipeline

flowchart TD
    A["Plan queries"] --> B["Search sources"]
    B --> C["Screen candidates"]
    C --> D["Quality and expansion"]
    D --> E["Download PDFs"]
    E --> F["Convert to Markdown"]
    F --> G["Extract evidence"]
    G --> H["Summarize papers"]
    H --> I["Report, validate, export"]

Profiles:

Profile	Stages	Use Case
`quick`	plan, search, screen, summarize	Fast abstract-only scan
`standard`	plan through summarize	Default full pipeline
`deep`	standard plus quality, expand, claim analysis, TER loop	Comprehensive literature review
`auto`	selected by query complexity	Mixed workloads

Search sources:

Source	Notes
`arxiv`	Polite arXiv API client with cache and rate limits
`scholar`	Google Scholar through `scholarly` or SerpAPI
`semantic_scholar`	Broad metadata, citations, and abstracts
`openalex`	Open bibliographic metadata
`dblp`	Computer science bibliography
`huggingface`	Recent HuggingFace daily papers
`all`	arXiv, Scholar, Semantic Scholar, OpenAlex, DBLP, HuggingFace

CLI Commands

Group	Commands
Core pipeline	`plan`, `search`, `screen`, `download`, `convert`, `extract`, `summarize`, `run`, `inspect`
Search expansion and organization	`quality`, `expand`, `cluster`, `enrich`, `watch`
Conversion and export	`convert-file`, `convert-rough`, `convert-fine`, `export-bibtex`, `export-html`, `report`
Analysis and validation	`analyze`, `analyze-claims`, `score-claims`, `confidence-layers`, `aggregate`, `validate`, `compare`, `evaluate`
Feedback and memory	`feedback`, `index`, `coherence`, `consolidate`, `memory-stats`, `memory-episodes`, `memory-search`
Knowledge graph	`kg-ingest`, `kg-stats`, `kg-query`, `kg-quality`, `cite-context`
Reliability checks	`blinding-audit`, `dual-metrics`, `adaptive-stopping`, `cbr-lookup`, `cbr-retain`
Setup	`setup` installs the bundled skill and paper-analysis agents

Useful examples:

# Citation graph expansion
research-pipeline expand --run-id <RUN_ID> --paper-ids "2401.12345" \
  --direction both --bfs-depth 2 --bfs-query "memory,agents"

# Evidence-only aggregation
research-pipeline aggregate --run-id <RUN_ID> --min-pointers 1

# Multi-run comparison and coherence
research-pipeline compare --run-a <RUN_A> --run-b <RUN_B>
research-pipeline coherence <RUN_A> <RUN_B> <RUN_C>

# Evaluation metrics (Deep Research Report gap closures)
# Unified Horizon Metric (A3-5): single scalar combining quality, difficulty,
# horizon length, stability, and Pass[k] reliability.
research-pipeline horizon --score 0.8 --achieved 40 --target 50 \
  --difficulty 0.6 --entropy-trend -0.1 --reliability 0.9

# Recall / Reasoning / Presentation diagnostic (Theme 16): localize the
# bottleneck axis of a synthesis report.
research-pipeline rrp --report report.md --shortlist shortlist.json

# Knowledge graph
research-pipeline kg-ingest --run-id <RUN_ID>
research-pipeline kg-stats
research-pipeline kg-query 2401.12345

Readable Reports

The pipeline can produce machine-readable synthesis JSON and human-readable Markdown or HTML reports. For human-facing reports, prefer:

clear headings and a contents section with internal links;
Mermaid diagrams for process charts, usually vertical flowchart TD charts;
LaTeX for formulas, using $...$ inline and $$...$$ for display equations;
tables for comparisons and coverage matrices;
paper links that jump to references or evidence-map entries;
recommendations linked back to findings, gaps, and evidence.

# Render Markdown from structured synthesis JSON
research-pipeline report --run-id <RUN_ID> --template structured_synthesis

# Export self-contained HTML
research-pipeline export-html --run-id <RUN_ID>

# Validate report completeness and readability signals
research-pipeline validate --run-id <RUN_ID>

MCP Server

Run the MCP server with:

research-pipeline mcp serve
# or, from a development checkout
uv run research-pipeline mcp serve

Current MCP surface:

42 tools covering pipeline stages, conversion, quality, expansion, validation, reporting, memory, KG, reliability, and the server-driven research_workflow.
15 resources for run manifests, plans, candidates, shortlists, PDFs, Markdown, summaries, synthesis, quality scores, config, index, workflow state, telemetry, and budget.
6 prompts for topic planning, workflow orchestration, paper analysis, comparison, search refinement, and quality assessment.

The research_workflow tool adds harness engineering: telemetry, bounded context, governance gates, structural verification, doom-loop monitoring, and crash recovery.

AI Skill And Agents

Install the bundled skill for Claude Code / GitHub Copilot and Codex, plus Claude Code sub-agent definitions:

research-pipeline setup              # skills + agents + MCP config snippet
research-pipeline setup --symlink    # symlink for development
research-pipeline setup --force      # overwrite existing files
research-pipeline setup --skip-agents
research-pipeline setup --skip-skill
research-pipeline setup --skip-mcp

Installed files:

Claude/GitHub Copilot skill: ~/.claude/skills/research-pipeline/
Codex skill: ~/.codex/skills/research-pipeline/
Agents: ~/.claude/agents/paper-screener.md, ~/.claude/agents/paper-analyzer.md, ~/.claude/agents/paper-synthesizer.md
MCP config snippet: ~/.config/research-pipeline/mcp.json

The skill follows Anthropic's Skill-Building Guide: it declares explicit trigger phrases and negative triggers, a license/compatibility frontmatter, concrete user-prompt → action Examples, and progressive disclosure into references/. Behaviorally, every run:

Resumes on top of any prior same-topic report in the working directory — the prior file is snapshot-renamed, prior paper IDs seed the new run, and the new report fully replaces the old one.
Iterates up to 4 gap-closure rounds — each round extracts the report's academic and engineering gaps, fills them (new pipeline iteration or implementation knowledge), and regenerates the report from scratch. Stops early when the gap list empties, a search returns no new papers, or the user marks gaps out-of-scope.
Enforces human-report formatting: ## Contents, ## Round History, Mermaid for every chart, LaTeX for every formula, and per-section evidence citations validated by research-pipeline validate.

Configuration

Start from the example config:

cp config.example.toml config.toml

High-impact settings:

profile = "standard"          # quick, standard, deep, auto
workspace = "runs"

[sources]
enabled = ["arxiv"]           # or include scholar, semantic_scholar, openalex, dblp, huggingface
scholar_backend = "scholarly" # or "serpapi"

[screen]
diversity = false
use_semantic_reranking = false

[conversion]
backend = "docling"
fallback_backends = []

[llm]
enabled = false               # enables LLM screening/summarization when configured
provider = "ollama"           # ollama or openai-compatible

[gates]
enabled = false
auto_approve = true

Environment overrides:

Variable	Purpose
`RESEARCH_PIPELINE_CONFIG`	Config file path
`RESEARCH_PIPELINE_CACHE_DIR`	Override cache directory
`RESEARCH_PIPELINE_WORKSPACE`	Override workspace directory
`RESEARCH_PIPELINE_DISABLE_LLM`	Force LLM features off
`RESEARCH_PIPELINE_LLM_PROFILE`	Select LLM profile

Artifacts

Each run writes auditable outputs under runs/<run_id>/:

runs/<run_id>/
├── plan/query_plan.json
├── search/candidates.jsonl
├── screen/shortlist.json
├── download/pdf/*.pdf
├── convert/markdown/*.md
├── convert_rough/markdown/*.md
├── convert_fine/markdown/*.md
├── extract/*.extract.json
├── extract/*.bibliography.json
├── summarize/extractions/*.extraction.json
├── summarize/extractions/*.extraction.md
├── summarize/extractions/extraction_quality.json
├── summarize/*.summary.json
├── summarize/synthesis_report.json
├── summarize/synthesis_report.md
├── summarize/synthesis_traceability.json
├── summarize/synthesis_quality.json
├── summarize/synthesis.json
├── summarize/synthesis_confidence.json
├── quality/quality_scores.jsonl
├── expand/expanded_candidates.jsonl
├── analysis/
├── comparison/
└── logs/

The runs/ and workspace/ directories are generated outputs and are not tracked by git.

Development

uv sync --extra dev --extra docling --extra scholar --extra reranker
uv run pytest tests/unit/ -xvs
uv run ruff format .
uv run ruff check . --fix
uv run mypy src/
uv run pre-commit run --all-files

See docs/architecture.md for architecture details and docs/user-guide.md for the full user guide.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

grammy.jiang

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.28.0

Jun 7, 2026

0.27.0

Jun 7, 2026

0.26.0

Jun 6, 2026

0.25.0

Jun 6, 2026

0.24.0

Jun 6, 2026

0.23.0

Jun 6, 2026

0.22.0

Jun 6, 2026

0.21.0

Jun 6, 2026

0.20.0

Jun 6, 2026

0.19.5

Jun 5, 2026

0.19.4

Jun 5, 2026

0.19.3

Jun 5, 2026

0.19.2

Jun 5, 2026

0.19.1

Jun 4, 2026

0.19.0

Jun 4, 2026

0.18.5

Jun 4, 2026

0.18.4

Jun 4, 2026

0.18.3

Jun 4, 2026

0.18.2

Jun 3, 2026

0.18.1

Jun 3, 2026

0.18.0

Jun 2, 2026

0.17.36

Jun 1, 2026

0.17.35

May 17, 2026

0.17.34

May 14, 2026

0.17.29

May 14, 2026

0.17.24

May 14, 2026

0.17.19

May 14, 2026

0.17.18

May 13, 2026

This version

0.17.17

May 13, 2026

0.17.16

May 13, 2026

0.17.15

May 13, 2026

0.17.14

May 13, 2026

0.17.13

May 12, 2026

0.17.12

May 12, 2026

0.17.11

May 12, 2026

0.17.10

May 12, 2026

0.17.9

May 12, 2026

0.17.8

May 12, 2026

0.17.7

May 12, 2026

0.17.6

May 12, 2026

0.17.5

May 12, 2026

0.17.4

May 12, 2026

0.17.3

May 11, 2026

0.17.2

May 11, 2026

0.17.1

May 11, 2026

0.17.0

Apr 27, 2026

0.16.2

Apr 27, 2026

0.16.1

Apr 21, 2026

0.16.0

Apr 20, 2026

0.15.1

Apr 20, 2026

0.15.0

Apr 20, 2026

0.14.4

Apr 19, 2026

0.14.3

Apr 18, 2026

0.14.2

Apr 18, 2026

0.14.1

Apr 18, 2026

0.14.0

Apr 18, 2026

0.13.52

Apr 18, 2026

0.13.51

Apr 17, 2026

0.13.50

Apr 17, 2026

0.13.49

Apr 17, 2026

0.13.48

Apr 17, 2026

0.13.47

Apr 17, 2026

0.13.46

Apr 17, 2026

0.13.45

Apr 17, 2026

0.13.44

Apr 17, 2026

0.13.43

Apr 17, 2026

0.13.42

Apr 17, 2026

0.13.41

Apr 17, 2026

0.13.40

Apr 17, 2026

0.13.39

Apr 17, 2026

0.13.38

Apr 17, 2026

0.13.37

Apr 17, 2026

0.13.36

Apr 17, 2026

0.13.35

Apr 17, 2026

0.13.34

Apr 17, 2026

0.13.33

Apr 17, 2026

0.13.32

Apr 17, 2026

0.13.31

Apr 17, 2026

0.13.30

Apr 17, 2026

0.13.29

Apr 17, 2026

0.13.28

Apr 17, 2026

0.13.27

Apr 17, 2026

0.13.26

Apr 17, 2026

0.13.25

Apr 17, 2026

0.13.24

Apr 17, 2026

0.13.23

Apr 17, 2026

0.13.22

Apr 17, 2026

0.13.21

Apr 17, 2026

0.13.20

Apr 17, 2026

0.13.19

Apr 17, 2026

0.13.18

Apr 17, 2026

0.13.17

Apr 17, 2026

0.13.16

Apr 17, 2026

0.13.15

Apr 17, 2026

0.13.14

Apr 17, 2026

0.13.13

Apr 17, 2026

0.13.12

Apr 17, 2026

0.13.11

Apr 16, 2026

0.13.10

Apr 16, 2026

0.13.9

Apr 16, 2026

0.13.8

Apr 16, 2026

0.13.7

Apr 16, 2026

0.13.6

Apr 16, 2026

0.13.5

Apr 16, 2026

0.13.4

Apr 16, 2026

0.13.3

Apr 16, 2026

0.13.2

Apr 16, 2026

0.13.1

Apr 16, 2026

0.13.0

Apr 15, 2026

0.12.14

Apr 15, 2026

0.12.13

Apr 15, 2026

0.12.12

Apr 15, 2026

0.12.11

Apr 15, 2026

0.12.10

Apr 15, 2026

0.12.9

Apr 15, 2026

0.12.8

Apr 15, 2026

0.12.7

Apr 14, 2026

0.12.6

Apr 14, 2026

0.12.5

Apr 14, 2026

0.12.4

Apr 14, 2026

0.12.3

Apr 14, 2026

0.12.2

Apr 14, 2026

0.12.1

Apr 14, 2026

0.12.0

Apr 14, 2026

0.11.0

Apr 14, 2026

0.10.0

Apr 14, 2026

0.9.0

Apr 14, 2026

0.8.1

Apr 14, 2026

0.8.0

Apr 14, 2026

0.7.1

Apr 14, 2026

0.7.0

Apr 14, 2026

0.6.0

Apr 14, 2026

0.5.0

Apr 13, 2026

0.4.0

Apr 8, 2026

0.3.0

Apr 5, 2026

0.2.0

Apr 5, 2026

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_pipeline-0.17.17.tar.gz (599.0 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

research_pipeline-0.17.17-py3-none-any.whl (793.5 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file research_pipeline-0.17.17.tar.gz.

File metadata

Download URL: research_pipeline-0.17.17.tar.gz
Upload date: May 13, 2026
Size: 599.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for research_pipeline-0.17.17.tar.gz
Algorithm	Hash digest
SHA256	`73be8dd2b15ffea493f3b4cf40b8e72571fd142a276eef3b975561b798091f08`
MD5	`804a3b447eaa43417ff41f48a9ae1b5a`
BLAKE2b-256	`bb7d2aeec55295e6999cd6466161e050e492a0936fffa85a5a23d445a9ed131a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_pipeline-0.17.17.tar.gz:

Publisher: publish.yml on grammy-jiang/research-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: research_pipeline-0.17.17.tar.gz
- Subject digest: 73be8dd2b15ffea493f3b4cf40b8e72571fd142a276eef3b975561b798091f08
- Sigstore transparency entry: 1526610832
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: grammy-jiang/research-pipeline@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0
- Branch / Tag: refs/tags/v0.17.17
- Owner: https://github.com/grammy-jiang
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0
- Trigger Event: release

File details

Details for the file research_pipeline-0.17.17-py3-none-any.whl.

File metadata

Download URL: research_pipeline-0.17.17-py3-none-any.whl
Upload date: May 13, 2026
Size: 793.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for research_pipeline-0.17.17-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b8c9a89c26f3fa321472575f789483a7de2e79041e43f79cdc36f65912e6ffb`
MD5	`7af19f58ff4d74fd3667cc2626b13d9b`
BLAKE2b-256	`c177002fe39f71fdbe12155c04e0a788d75b9caa981c842e3550cea5af2c3e91`

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_pipeline-0.17.17-py3-none-any.whl:

Publisher: publish.yml on grammy-jiang/research-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: research_pipeline-0.17.17-py3-none-any.whl
- Subject digest: 2b8c9a89c26f3fa321472575f789483a7de2e79041e43f79cdc36f65912e6ffb
- Sigstore transparency entry: 1526610929
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: grammy-jiang/research-pipeline@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0
- Branch / Tag: refs/tags/v0.17.17
- Owner: https://github.com/grammy-jiang
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0
- Trigger Event: release

research-pipeline 0.17.17

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

research-pipeline

Contents

What It Does

Installation

Quick Start

Pipeline

CLI Commands

Readable Reports

MCP Server

AI Skill And Agents

Configuration

Artifacts

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance