A multi-source pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, and more.
Project description
research-pipeline
A production-grade, deterministic Python pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv and Google Scholar.
Features
- 7-stage pipeline: plan → search → screen → download → convert → extract → summarize
- Modular CLI with independent, composable stage commands
- MCP server for AI agent integration (10 tools via stdio transport)
- Multi-source search: arXiv API + Google Scholar (free & SerpAPI)
- Idempotent & resumable — every stage can be re-run safely
- arXiv polite-mode — strict rate limiting, single connection, caching
- Deterministic tool chain with optional LLM judgment
- Full artifact lineage — every run is reproducible and auditable via manifests
- Offline-first testing — no live API calls in CI
Installation
# From PyPI
pip install research-pipeline
# With PDF conversion support (Docling)
pip install research-pipeline[docling]
# With Google Scholar support
pip install research-pipeline[scholar]
# With all extras
pip install research-pipeline[docling,scholar]
Development install
# With uv (recommended)
uv sync --extra dev --extra docling --extra scholar
Quick start
# Full end-to-end pipeline
research-pipeline run "transformer architectures for time series forecasting"
# Or run stages individually
research-pipeline plan "transformer architectures for time series forecasting"
research-pipeline search --run-id <RUN_ID>
research-pipeline screen --run-id <RUN_ID>
research-pipeline download --run-id <RUN_ID>
research-pipeline convert --run-id <RUN_ID>
research-pipeline extract --run-id <RUN_ID>
research-pipeline summarize --run-id <RUN_ID>
# Inspect run status
research-pipeline inspect --run-id <RUN_ID>
# Standalone PDF conversion (no workspace required)
research-pipeline convert-file paper.pdf -o paper.md
Commands
| Command | Purpose |
|---|---|
plan |
Normalize topic → structured query plan |
search |
Execute multi-source search (arXiv + Scholar) |
screen |
Two-stage relevance filtering (BM25 + optional LLM) |
download |
Download shortlisted PDFs with rate limiting |
convert |
PDF → Markdown via Docling |
extract |
Structured content extraction & chunking |
summarize |
Per-paper summaries + cross-paper synthesis |
run |
End-to-end orchestration of all stages |
inspect |
View run manifests and artifacts |
convert-file |
Standalone PDF → Markdown conversion |
MCP server
The MCP server exposes all pipeline stages as tools for AI agent integration:
# Run via module
uv run python -m mcp_server
# Available tools: plan_topic, search, screen_candidates, download_pdfs,
# convert_pdfs, extract_content, summarize_papers, run_pipeline,
# get_run_manifest, convert_file
Configuration
Copy config.example.toml to config.toml and adjust settings:
cp config.example.toml config.toml
Key environment variables:
| Variable | Purpose |
|---|---|
ARXIV_PAPER_PIPELINE_CONFIG |
Config file path |
ARXIV_PAPER_PIPELINE_CACHE_DIR |
Override cache directory |
ARXIV_PAPER_PIPELINE_WORKSPACE |
Override workspace directory |
ARXIV_PAPER_PIPELINE_DISABLE_LLM |
Force LLM off |
Artifact layout
Each pipeline run produces outputs in runs/<run_id>/:
runs/<run_id>/
├── run_config.json # Configuration snapshot
├── run_manifest.json # Execution metadata & stage records
├── plan/query_plan.json # Normalized query plan
├── search/
│ ├── raw/*.xml # Raw API response pages
│ └── candidates.jsonl # Deduplicated candidates
├── screen/
│ ├── cheap_scores.jsonl # Heuristic scores
│ └── shortlist.json # Papers selected for download
├── download/
│ ├── pdf/*.pdf # Downloaded papers
│ └── download_manifest.jsonl
├── convert/
│ ├── markdown/*.md # Converted Markdown
│ └── convert_manifest.jsonl
├── extract/*.extract.json # Chunked & indexed extraction
├── summarize/
│ ├── *.summary.json # Per-paper summaries
│ ├── synthesis.json # Cross-paper synthesis
│ └── synthesis.md # Human-readable synthesis
└── logs/pipeline.jsonl # Structured logs
Development
# Install dev dependencies
uv sync --extra dev
# Run unit tests
uv run pytest tests/unit/ -xvs
# Format, lint, type check
uv run isort . && uv run black . && uv run ruff check . --fix
uv run mypy src/
# Run all pre-commit hooks
uv run pre-commit run --all-files
See docs/architecture.md for detailed architecture documentation and docs/user-guide.md for the full user guide.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file research_pipeline-0.1.0.tar.gz.
File metadata
- Download URL: research_pipeline-0.1.0.tar.gz
- Upload date:
- Size: 45.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3afb818d723cddc9dd8de470a4508a310514a6a4603259ddd258e9f24be4e65
|
|
| MD5 |
3dcbab463016eef10ca6f1c2e0e33bac
|
|
| BLAKE2b-256 |
39588de7500fcfbad8bc49b03698bf4185e95292baee058e44d4db860499dfad
|
File details
Details for the file research_pipeline-0.1.0-py3-none-any.whl.
File metadata
- Download URL: research_pipeline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 75.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a885f6f541bee952450fb16df4fc8ee2132fd14114a1a2b95dd94b5ccb759d02
|
|
| MD5 |
fb7d01bffeb5802ff08e761a60eb9d79
|
|
| BLAKE2b-256 |
6c1ca3afd44792faed8f6f10af45e6f188dcc9d38e8fc9a24b6ac66d4f6f74f9
|