A multi-source pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, and more.

These details have not been verified by PyPI

Project links

Project description

research-pipeline

A production-grade, deterministic Python pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv and Google Scholar.

Features

7-stage pipeline: plan → search → screen → download → convert → extract → summarize
Modular CLI with independent, composable stage commands
MCP server for AI agent integration (12 tools via stdio transport)
Multi-source search: arXiv API + Google Scholar (free & SerpAPI)
Multi-backend PDF conversion: Docling (MIT), Marker (highest accuracy), PyMuPDF4LLM (fastest)
Idempotent & resumable — every stage can be re-run safely
arXiv polite-mode — strict rate limiting, single connection, caching
Deterministic tool chain with optional LLM judgment
Full artifact lineage — every run is reproducible and auditable via manifests
Offline-first testing — no live API calls in CI

Installation

# From PyPI
pip install research-pipeline

# With PDF conversion backends
pip install research-pipeline[docling]       # MIT license, great tables/equations
pip install research-pipeline[marker]        # Highest accuracy (95.7%), GPL-3.0
pip install research-pipeline[pymupdf4llm]   # Fastest (10-50x), AGPL

# With Google Scholar support
pip install research-pipeline[scholar]

# With all extras
pip install research-pipeline[docling,marker,pymupdf4llm,scholar]

Development install

# With uv (recommended)
uv sync --extra dev --extra docling --extra scholar

Quick start

# Full end-to-end pipeline
research-pipeline run "transformer architectures for time series forecasting"

# Or run stages individually
research-pipeline plan "transformer architectures for time series forecasting"
research-pipeline search --run-id <RUN_ID>
research-pipeline screen --run-id <RUN_ID>
research-pipeline download --run-id <RUN_ID>
research-pipeline convert --run-id <RUN_ID>
research-pipeline extract --run-id <RUN_ID>
research-pipeline summarize --run-id <RUN_ID>

# Inspect run status
research-pipeline inspect --run-id <RUN_ID>

# Standalone PDF conversion (no workspace required)
research-pipeline convert-file paper.pdf -o paper.md

# Use a specific conversion backend
research-pipeline convert --run-id <RUN_ID> --backend marker
research-pipeline convert-file paper.pdf --backend pymupdf4llm

Commands

Command	Purpose
`plan`	Normalize topic → structured query plan
`search`	Execute multi-source search (arXiv + Scholar)
`screen`	Two-stage relevance filtering (BM25 + optional LLM)
`download`	Download shortlisted PDFs with rate limiting
`convert`	PDF → Markdown (docling, marker, or pymupdf4llm)
`extract`	Structured content extraction & chunking
`summarize`	Per-paper summaries + cross-paper synthesis
`run`	End-to-end orchestration of all stages
`inspect`	View run manifests and artifacts
`convert-file`	Standalone PDF → Markdown conversion

MCP server

The MCP server exposes all pipeline stages as tools for AI agent integration:

# Run via module
uv run python -m mcp_server

# Available tools: plan_topic, search, screen_candidates, download_pdfs,
# convert_pdfs, extract_content, summarize_papers, run_pipeline,
# get_run_manifest, convert_file, list_backends

Configuration

Copy config.example.toml to config.toml and adjust settings:

cp config.example.toml config.toml

Key environment variables:

Variable	Purpose
`ARXIV_PAPER_PIPELINE_CONFIG`	Config file path
`ARXIV_PAPER_PIPELINE_CACHE_DIR`	Override cache directory
`ARXIV_PAPER_PIPELINE_WORKSPACE`	Override workspace directory
`ARXIV_PAPER_PIPELINE_DISABLE_LLM`	Force LLM off

Artifact layout

Each pipeline run produces outputs in runs/<run_id>/:

runs/<run_id>/
├── run_config.json            # Configuration snapshot
├── run_manifest.json          # Execution metadata & stage records
├── plan/query_plan.json       # Normalized query plan
├── search/
│   ├── raw/*.xml              # Raw API response pages
│   └── candidates.jsonl       # Deduplicated candidates
├── screen/
│   ├── cheap_scores.jsonl     # Heuristic scores
│   └── shortlist.json         # Papers selected for download
├── download/
│   ├── pdf/*.pdf              # Downloaded papers
│   └── download_manifest.jsonl
├── convert/
│   ├── markdown/*.md          # Converted Markdown
│   └── convert_manifest.jsonl
├── extract/*.extract.json     # Chunked & indexed extraction
├── summarize/
│   ├── *.summary.json         # Per-paper summaries
│   ├── synthesis.json         # Cross-paper synthesis
│   └── synthesis.md           # Human-readable synthesis
└── logs/pipeline.jsonl        # Structured logs

Development

# Install dev dependencies
uv sync --extra dev

# Run unit tests
uv run pytest tests/unit/ -xvs

# Format, lint, type check
uv run isort . && uv run black . && uv run ruff check . --fix
uv run mypy src/

# Run all pre-commit hooks
uv run pre-commit run --all-files

See docs/architecture.md for detailed architecture documentation and docs/user-guide.md for the full user guide.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.14.3

Apr 18, 2026

0.14.2

Apr 18, 2026

0.14.1

Apr 18, 2026

0.14.0

Apr 18, 2026

0.13.52

Apr 18, 2026

0.13.51

Apr 17, 2026

0.13.50

Apr 17, 2026

0.13.49

Apr 17, 2026

0.13.48

Apr 17, 2026

0.13.47

Apr 17, 2026

0.13.46

Apr 17, 2026

0.13.45

Apr 17, 2026

0.13.44

Apr 17, 2026

0.13.43

Apr 17, 2026

0.13.42

Apr 17, 2026

0.13.41

Apr 17, 2026

0.13.40

Apr 17, 2026

0.13.39

Apr 17, 2026

0.13.38

Apr 17, 2026

0.13.37

Apr 17, 2026

0.13.36

Apr 17, 2026

0.13.35

Apr 17, 2026

0.13.34

Apr 17, 2026

0.13.33

Apr 17, 2026

0.13.32

Apr 17, 2026

0.13.31

Apr 17, 2026

0.13.30

Apr 17, 2026

0.13.29

Apr 17, 2026

0.13.28

Apr 17, 2026

0.13.27

Apr 17, 2026

0.13.26

Apr 17, 2026

0.13.25

Apr 17, 2026

0.13.24

Apr 17, 2026

0.13.23

Apr 17, 2026

0.13.22

Apr 17, 2026

0.13.21

Apr 17, 2026

0.13.20

Apr 17, 2026

0.13.19

Apr 17, 2026

0.13.18

Apr 17, 2026

0.13.17

Apr 17, 2026

0.13.16

Apr 17, 2026

0.13.15

Apr 17, 2026

0.13.14

Apr 17, 2026

0.13.13

Apr 17, 2026

0.13.12

Apr 17, 2026

0.13.11

Apr 16, 2026

0.13.10

Apr 16, 2026

0.13.9

Apr 16, 2026

0.13.8

Apr 16, 2026

0.13.7

Apr 16, 2026

0.13.6

Apr 16, 2026

0.13.5

Apr 16, 2026

0.13.4

Apr 16, 2026

0.13.3

Apr 16, 2026

0.13.2

Apr 16, 2026

0.13.1

Apr 16, 2026

0.13.0

Apr 15, 2026

0.12.14

Apr 15, 2026

0.12.13

Apr 15, 2026

0.12.12

Apr 15, 2026

0.12.11

Apr 15, 2026

0.12.10

Apr 15, 2026

0.12.9

Apr 15, 2026

0.12.8

Apr 15, 2026

0.12.7

Apr 14, 2026

0.12.6

Apr 14, 2026

0.12.5

Apr 14, 2026

0.12.4

Apr 14, 2026

0.12.3

Apr 14, 2026

0.12.2

Apr 14, 2026

0.12.1

Apr 14, 2026

0.12.0

Apr 14, 2026

0.11.0

Apr 14, 2026

0.10.0

Apr 14, 2026

0.9.0

Apr 14, 2026

0.8.1

Apr 14, 2026

0.8.0

Apr 14, 2026

0.7.1

Apr 14, 2026

0.7.0

Apr 14, 2026

0.6.0

Apr 14, 2026

0.5.0

Apr 13, 2026

0.4.0

Apr 8, 2026

0.3.0

Apr 5, 2026

This version

0.2.0

Apr 5, 2026

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_pipeline-0.2.0.tar.gz (48.1 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

research_pipeline-0.2.0-py3-none-any.whl (80.8 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file research_pipeline-0.2.0.tar.gz.

File metadata

Download URL: research_pipeline-0.2.0.tar.gz
Upload date: Apr 5, 2026
Size: 48.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for research_pipeline-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d31c5f6f0b263dce680edc2aac62f75e7b9f94e31fa27ccec623becac289dfba`
MD5	`e19436ddf2937f24dc2a54c3f3d8c600`
BLAKE2b-256	`89621db9c69618dc233cefe2f9de8029160745a0cd96fe6381ffa8f99acc59f0`

See more details on using hashes here.

File details

Details for the file research_pipeline-0.2.0-py3-none-any.whl.

File metadata

Download URL: research_pipeline-0.2.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 80.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Fedora Linux","version":"43","id":"","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for research_pipeline-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da85298748f663a9ff6b97cfea6c714b1cb0dd49daba600d0a50594cfe391528`
MD5	`d0d99aa65f6ec5d2c1d1eb10ab6ab649`
BLAKE2b-256	`f206dd9f6f2755abb3bc2637bc65c061891b17bd8845ad1311a686c1c4b801b9`

See more details on using hashes here.

research-pipeline 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

research-pipeline

Features

Installation

Development install

Quick start

Commands

MCP server

Configuration

Artifact layout

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes