Skip to main content

A multi-source pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, and more.

Project description

research-pipeline

PyPI version Python 3.12+ License: MIT

A production-grade, deterministic Python pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv and Google Scholar.

Features

  • 7-stage pipeline: plan → search → screen → download → convert → extract → summarize
  • Modular CLI with independent, composable stage commands
  • MCP server for AI agent integration (10 tools via stdio transport)
  • Multi-source search: arXiv API + Google Scholar (free & SerpAPI)
  • Idempotent & resumable — every stage can be re-run safely
  • arXiv polite-mode — strict rate limiting, single connection, caching
  • Deterministic tool chain with optional LLM judgment
  • Full artifact lineage — every run is reproducible and auditable via manifests
  • Offline-first testing — no live API calls in CI

Installation

# From PyPI
pip install research-pipeline

# With PDF conversion support (Docling)
pip install research-pipeline[docling]

# With Google Scholar support
pip install research-pipeline[scholar]

# With all extras
pip install research-pipeline[docling,scholar]

Development install

# With uv (recommended)
uv sync --extra dev --extra docling --extra scholar

Quick start

# Full end-to-end pipeline
research-pipeline run "transformer architectures for time series forecasting"

# Or run stages individually
research-pipeline plan "transformer architectures for time series forecasting"
research-pipeline search --run-id <RUN_ID>
research-pipeline screen --run-id <RUN_ID>
research-pipeline download --run-id <RUN_ID>
research-pipeline convert --run-id <RUN_ID>
research-pipeline extract --run-id <RUN_ID>
research-pipeline summarize --run-id <RUN_ID>

# Inspect run status
research-pipeline inspect --run-id <RUN_ID>

# Standalone PDF conversion (no workspace required)
research-pipeline convert-file paper.pdf -o paper.md

Commands

Command Purpose
plan Normalize topic → structured query plan
search Execute multi-source search (arXiv + Scholar)
screen Two-stage relevance filtering (BM25 + optional LLM)
download Download shortlisted PDFs with rate limiting
convert PDF → Markdown via Docling
extract Structured content extraction & chunking
summarize Per-paper summaries + cross-paper synthesis
run End-to-end orchestration of all stages
inspect View run manifests and artifacts
convert-file Standalone PDF → Markdown conversion

MCP server

The MCP server exposes all pipeline stages as tools for AI agent integration:

# Run via module
uv run python -m mcp_server

# Available tools: plan_topic, search, screen_candidates, download_pdfs,
# convert_pdfs, extract_content, summarize_papers, run_pipeline,
# get_run_manifest, convert_file

Configuration

Copy config.example.toml to config.toml and adjust settings:

cp config.example.toml config.toml

Key environment variables:

Variable Purpose
ARXIV_PAPER_PIPELINE_CONFIG Config file path
ARXIV_PAPER_PIPELINE_CACHE_DIR Override cache directory
ARXIV_PAPER_PIPELINE_WORKSPACE Override workspace directory
ARXIV_PAPER_PIPELINE_DISABLE_LLM Force LLM off

Artifact layout

Each pipeline run produces outputs in runs/<run_id>/:

runs/<run_id>/
├── run_config.json            # Configuration snapshot
├── run_manifest.json          # Execution metadata & stage records
├── plan/query_plan.json       # Normalized query plan
├── search/
│   ├── raw/*.xml              # Raw API response pages
│   └── candidates.jsonl       # Deduplicated candidates
├── screen/
│   ├── cheap_scores.jsonl     # Heuristic scores
│   └── shortlist.json         # Papers selected for download
├── download/
│   ├── pdf/*.pdf              # Downloaded papers
│   └── download_manifest.jsonl
├── convert/
│   ├── markdown/*.md          # Converted Markdown
│   └── convert_manifest.jsonl
├── extract/*.extract.json     # Chunked & indexed extraction
├── summarize/
│   ├── *.summary.json         # Per-paper summaries
│   ├── synthesis.json         # Cross-paper synthesis
│   └── synthesis.md           # Human-readable synthesis
└── logs/pipeline.jsonl        # Structured logs

Development

# Install dev dependencies
uv sync --extra dev

# Run unit tests
uv run pytest tests/unit/ -xvs

# Format, lint, type check
uv run isort . && uv run black . && uv run ruff check . --fix
uv run mypy src/

# Run all pre-commit hooks
uv run pre-commit run --all-files

See docs/architecture.md for detailed architecture documentation and docs/user-guide.md for the full user guide.

License

MIT

Project details


Release history Release notifications | RSS feed

This version

0.1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_pipeline-0.1.0.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

research_pipeline-0.1.0-py3-none-any.whl (75.0 kB view details)

Uploaded Python 3

File details

Details for the file research_pipeline-0.1.0.tar.gz.

File metadata

  • Download URL: research_pipeline-0.1.0.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for research_pipeline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b3afb818d723cddc9dd8de470a4508a310514a6a4603259ddd258e9f24be4e65
MD5 3dcbab463016eef10ca6f1c2e0e33bac
BLAKE2b-256 39588de7500fcfbad8bc49b03698bf4185e95292baee058e44d4db860499dfad

See more details on using hashes here.

File details

Details for the file research_pipeline-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: research_pipeline-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 75.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for research_pipeline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a885f6f541bee952450fb16df4fc8ee2132fd14114a1a2b95dd94b5ccb759d02
MD5 fb7d01bffeb5802ff08e761a60eb9d79
BLAKE2b-256 6c1ca3afd44792faed8f6f10af45e6f188dcc9d38e8fc9a24b6ac66d4f6f74f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page