Skip to main content

A multi-source pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, and more.

Project description

research-pipeline

PyPI version Python 3.12+ License: MIT

A production-grade, deterministic Python pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, Semantic Scholar, OpenAlex, and DBLP.

Features

  • Multi-stage pipeline: plan → search → screen → download → convert → extract → summarize
  • 5 new auxiliary commands: expand (citation graph), quality (evaluation scoring), convert-rough / convert-fine (two-tier conversion), index (incremental runs)
  • Modular CLI with independent, composable stage commands
  • MCP server for AI agent integration (16 tools, 12 resources, 5 prompts, completions, progress reporting)
  • Multi-source search: arXiv + Google Scholar + Semantic Scholar + OpenAlex + DBLP
  • Cross-source enrichment — fill missing abstracts via DOI lookup
  • Semantic re-ranking — optional SPECTER2 embeddings for similarity scoring
  • Citation graph expansion — discover related papers via Semantic Scholar citations
  • Quality evaluation — composite scoring: citation impact, venue reputation, author h-index, recency
  • Multi-backend PDF conversion: 3 local (Docling, Marker, PyMuPDF4LLM) + 5 cloud (Mathpix, Datalab, LlamaParse, Mistral OCR, OpenAI Vision)
  • Two-tier conversion — fast convert-rough for all papers, high-quality convert-fine for selected ones
  • Multi-account rotation — rotate between accounts per service on quota exhaustion
  • Cross-service fallback — automatic failover to next backend when all accounts are exhausted
  • Incremental runs — SQLite global index deduplicates papers across runs
  • Retry & error recovery@retry decorator with exponential backoff, jitter, and Retry-After support
  • Idempotent & resumable — every stage can be re-run safely
  • arXiv polite-mode — strict rate limiting, single connection, caching
  • Deterministic tool chain with optional LLM judgment
  • Full artifact lineage — every run is reproducible and auditable via manifests
  • Offline-first testing — no live API calls in CI

Installation

# From PyPI
pip install research-pipeline

# With local PDF conversion backends
pip install research-pipeline[docling]       # MIT license, great tables/equations
pip install research-pipeline[marker]        # Highest accuracy (95.7%), GPL-3.0
pip install research-pipeline[pymupdf4llm]   # Fastest (10-50x), AGPL

# With cloud PDF conversion backends (require API keys)
pip install research-pipeline[mathpix]       # Best LaTeX, 1K free pages/mo
pip install research-pipeline[datalab]       # Hosted Marker, $5 free credit
pip install research-pipeline[llamaparse]    # 1K free pages/day
pip install research-pipeline[mistral-ocr]   # Mistral OCR, free credits
pip install research-pipeline[openai-vision] # GPT-4o vision

# With Google Scholar support
pip install research-pipeline[scholar]

# With all extras
pip install research-pipeline[docling,marker,pymupdf4llm,scholar]

Development install

# With uv (recommended)
uv sync --extra dev --extra docling --extra scholar

Quick start

# Full end-to-end pipeline
research-pipeline run "transformer architectures for time series forecasting"

# Or run stages individually
research-pipeline plan "transformer architectures for time series forecasting"
research-pipeline search --run-id <RUN_ID>
research-pipeline screen --run-id <RUN_ID>
research-pipeline download --run-id <RUN_ID>
research-pipeline convert --run-id <RUN_ID>
research-pipeline extract --run-id <RUN_ID>
research-pipeline summarize --run-id <RUN_ID>

# Inspect run status
research-pipeline inspect --run-id <RUN_ID>

# Standalone PDF conversion (no workspace required)
research-pipeline convert-file paper.pdf -o paper.md

# Use a specific conversion backend
research-pipeline convert --run-id <RUN_ID> --backend marker
research-pipeline convert-file paper.pdf --backend pymupdf4llm

# Two-tier conversion: rough (fast) then fine (high-quality)
research-pipeline convert-rough --run-id <RUN_ID>
research-pipeline convert-fine --run-id <RUN_ID>

# Evaluate paper quality (citation impact, venue, author)
research-pipeline quality --run-id <RUN_ID>

# Expand via citation graph (Semantic Scholar)
research-pipeline expand --run-id <RUN_ID> --direction both

# Manage global paper index (incremental dedup)
research-pipeline index --list

Commands

Command Purpose
plan Normalize topic → structured query plan
search Execute multi-source search (arXiv, Scholar, Semantic Scholar, OpenAlex, DBLP)
screen Two-stage relevance filtering (BM25 + optional SPECTER2 + optional LLM)
download Download shortlisted PDFs with rate limiting and retry
convert PDF → Markdown (8 backends, multi-account rotation, cross-service fallback)
convert-rough Fast Tier 2 conversion (pymupdf4llm) for all downloaded PDFs
convert-fine High-quality Tier 3 conversion for selected papers
extract Structured content extraction & chunking
summarize Per-paper summaries + cross-paper synthesis
expand Citation graph expansion via Semantic Scholar API
quality Composite quality evaluation (citations, venue, author, recency)
run End-to-end orchestration of all stages
inspect View run manifests and artifacts
convert-file Standalone PDF → Markdown conversion
index Manage the global paper index for incremental runs
install-skill Install the Claude/Copilot skill to ~/.claude/skills/

MCP server

The MCP server provides full Model Context Protocol support for AI agent integration:

# Run via module
uv run python -m mcp_server

16 tools — all pipeline stages plus auxiliary commands:

plan_topic, search, screen_candidates, download_pdfs, convert_pdfs, extract_content, summarize_papers, run_pipeline, get_run_manifest, convert_file, list_backends, expand_citations, evaluate_quality, convert_rough, convert_fine, manage_index

12 resources — read pipeline artifacts via URI templates:

runs://list, runs://{run_id}/manifest, runs://{run_id}/plan, runs://{run_id}/candidates, runs://{run_id}/shortlist, runs://{run_id}/papers/{paper_id}, runs://{run_id}/markdown/{paper_id}, runs://{run_id}/summary/{paper_id}, runs://{run_id}/synthesis, runs://{run_id}/quality, config://current, index://papers

5 prompts — research workflow templates:

research_topic, analyze_paper, compare_papers, refine_search, quality_assessment

Plus: tool annotations, auto-completions, and progress reporting.

AI skill

Install the bundled Claude Code / GitHub Copilot skill:

# Copy skill files to ~/.claude/skills/research-pipeline/
research-pipeline install-skill

# Or create a symlink (for development)
research-pipeline install-skill --symlink

# Force overwrite existing
research-pipeline install-skill --force

Configuration

Copy config.example.toml to config.toml and adjust settings:

cp config.example.toml config.toml

Key environment variables:

Variable Purpose
ARXIV_PAPER_PIPELINE_CONFIG Config file path
ARXIV_PAPER_PIPELINE_CACHE_DIR Override cache directory
ARXIV_PAPER_PIPELINE_WORKSPACE Override workspace directory
ARXIV_PAPER_PIPELINE_DISABLE_LLM Force LLM off

Artifact layout

Each pipeline run produces outputs in runs/<run_id>/:

runs/<run_id>/
├── run_config.json            # Configuration snapshot
├── run_manifest.json          # Execution metadata & stage records
├── plan/query_plan.json       # Normalized query plan
├── search/
│   ├── raw/*.xml              # Raw API response pages
│   └── candidates.jsonl       # Deduplicated candidates
├── screen/
│   ├── cheap_scores.jsonl     # Heuristic scores
│   └── shortlist.json         # Papers selected for download
├── download/
│   ├── pdf/*.pdf              # Downloaded papers
│   └── download_manifest.jsonl
├── convert/
│   ├── markdown/*.md          # Converted Markdown
│   └── convert_manifest.jsonl
├── convert_rough/             # Tier 2: fast conversion (all PDFs)
│   ├── markdown/*.md
│   └── convert_manifest.jsonl
├── convert_fine/              # Tier 3: high-quality conversion (selected)
│   ├── markdown/*.md
│   └── convert_manifest.jsonl
├── quality/                   # Quality evaluation scores
│   └── quality_scores.jsonl
├── expand/                    # Citation graph expansion
│   └── expanded_candidates.jsonl
├── extract/*.extract.json     # Chunked & indexed extraction
├── summarize/
│   ├── *.summary.json         # Per-paper summaries
│   ├── synthesis.json         # Cross-paper synthesis
│   └── synthesis.md           # Human-readable synthesis
└── logs/pipeline.jsonl        # Structured logs

Development

# Install dev dependencies
uv sync --extra dev

# Run unit tests
uv run pytest tests/unit/ -xvs

# Format, lint, type check
uv run isort . && uv run black . && uv run ruff check . --fix
uv run mypy src/

# Run all pre-commit hooks
uv run pre-commit run --all-files

See docs/architecture.md for detailed architecture documentation and docs/user-guide.md for the full user guide.

License

MIT

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_pipeline-0.5.0.tar.gz (91.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

research_pipeline-0.5.0-py3-none-any.whl (149.2 kB view details)

Uploaded Python 3

File details

Details for the file research_pipeline-0.5.0.tar.gz.

File metadata

  • Download URL: research_pipeline-0.5.0.tar.gz
  • Upload date:
  • Size: 91.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for research_pipeline-0.5.0.tar.gz
Algorithm Hash digest
SHA256 ffc2391ac420663ff6884baca6cc6e4905cdf0b7465f7952b2b10e8a768ac50c
MD5 c57a40015c2300ec066606e11cbf3282
BLAKE2b-256 5c22bb60674f3638951e1089e01aad05a7d9e9d17cdad6f215965fec67644977

See more details on using hashes here.

File details

Details for the file research_pipeline-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: research_pipeline-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 149.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for research_pipeline-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 efb529f73088be4c124af72cb5dc17de5421dd3beeff26f052f5a388d4e38ae0
MD5 70a313701d339cf736efa3609795b906
BLAKE2b-256 6b16cc2a507f0fcdd41a090a6392d66abe9cb1b735bd9be05173723bb50af013

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page