智能文献搜索与批量下载工具，支持高级检索和并发下载

Project description

PDFGet - 智能文献搜索与批量下载工具

PyPI Python 3.12+ License

智能文献搜索与批量下载工具，支持高级检索和并发下载。

0.1.6 更新

统一公开入口：PDFDownloader 只暴露 download_paper(record)、PaperSearcher 只暴露 search_papers(query, limit, source)，所有内部支撑（按源分派、文件 IO、原始 API 调用）一律下划线化。
LocalPDFStore：新增独立模块 src/pdfget/storage.py，把本地 PDF 存档职责（路径解析、存在性、流式写入、列表、清理、统计）从 PDFDownloader 中分离。
execute_download_plan：download_service.py 拆出"输入 plan → 实际下载"纯函数 execute_download_plan，CLI workflow 不再复制 façade 逻辑，统一通过它驱动 UnifiedDownloadManager。
Python API 增加一个可注入钩子：低层 LocalPDFStore 可被外部脚本直接调用（路径、写入、列表、清理、cache_info()），便于把 PDF 存档接入更大的工作流。
测试套重组：按"单一公开入口"重写 test_searcher.py / test_downloader.py / test_arxiv_* / test_integration_pmc_oa.py；新增 test_storage.py；新增公共表面哨兵测试，盯死未来添加新 public 方法的回归。

0.1.5 更新

新增完整的 arXiv 搜索与下载链路，支持 -S arxiv 搜索、直接下载 arXiv PDF，以及通过 -m "2301.12345" 这类输入直接触发下载。
新增 -S all 联合搜索模式，可同时检索 PubMed、Europe PMC 和 arXiv。
新增标准化论文记录 schema，以及 schema-first 的 JSON 搜索/下载输出，便于脚本、Agent 和自动化系统稳定消费。
--format json 模式下的日志统一输出到 stderr，stdout 保持为单一 JSON payload。
并发下载结果现在按输入顺序稳定返回，即使存在重复 PMCID、DOI 或 arXiv ID 也不会串位。
补齐 arXiv、JSON 输出和统一输入链路的回归测试，并同步清理类型检查与测试基线。

JSON 输出示例

# 机器可读搜索输出
pdfget -s "vision transformer" -S arxiv --format json > search.json

# 机器可读下载输出
pdfget -s "vision transformer" -S arxiv -d --format json > download.json

在 0.1.5 中，以上命令生成的 search.json / download.json 只包含合法 JSON，日志信息会写入 stderr。

新能力速览

# 同时搜索 PubMed + Europe PMC + arXiv
pdfget -s "large language model" -S all -l 30

# 直接下载 arXiv ID
pdfget -m "2301.12345" -d

项目概述

PDFGet 是一个专为科研工作者设计的智能文献获取工具，集成 PubMed、Europe PMC、arXiv 等常用学术数据源。

核心特性

🔍 智能搜索 - 支持高级检索语法，精确查找文献
📥 批量下载 - 自动下载开放获取 PDF，支持并发
📋 混合输入 - 支持 CSV 文件与命令行中的 PMCID/PMID/DOI/arXiv ID 混合下载
🎯 PMC 过滤 - 使用 pubmed pmc[sb] 确保 100% 可下载
🧠 arXiv 支持 - 支持按 -S arxiv 搜索并直接下载 arXiv PDF
💾 智能缓存 - 避免重复下载，提升效率

快速开始

安装

# 使用 pip 安装
pip install pdfget

# 使用 uv 安装（推荐）
uv add pdfget

5分钟上手

# 1. 搜索并下载 20 篇癌症相关文献
pdfget -s "cancer AND pubmed pmc[sb]" -l 20 -d

# 2. 从 CSV 文件批量下载
pdfget -m pmcids.csv -c pmcid -d

# 3. 下载单个文献
pdfget -m "PMC5764346" -d

# 4. 搜索并下载 arXiv 论文
pdfget -s "vision transformer" -S arxiv -l 20 -d

# 5. 查看统计信息（不下载）
pdfget -s "machine learning" -l 100

常见使用场景

场景1：搜索并下载可下载的文献

使用 PMC 过滤器确保所有结果都能下载：

# 搜索并下载 50 篇机器学习相关的 PMC 收录文献
pdfget -s "machine learning AND pubmed pmc[sb]" -l 50 -d -o my_papers

场景2：CSV 批量下载

从 CSV 文件批量下载 PMCID：

# CSV 文件格式：
# pmcid
# PMC5764346
# PMC5761748
# ...

# 执行下载
pdfget -m pmcids.csv -c pmcid -d -t 5

场景3：统计开放获取情况

了解某个领域的开放获取比例：

# 统计癌症免疫疗法文献的 PMCID 情况
pdfget -s "cancer immunotherapy" -l 1000

场景4：混合标识符下载

支持 PMCID、PMID、DOI、arXiv ID 混合输入：

# 单个或多个标识符
pdfget -m "PMC123456"
pdfget -m "PMC123456,38238491,10.1186/s12916-020-01690-4,2301.12345" -d

场景5：arXiv 搜索与下载

适合机器学习、计算机视觉、LLM 等论文的快速获取：

# 搜索 arXiv
pdfget -s "large language model reasoning" -S arxiv -l 20

# 直接下载 arXiv PDF
pdfget -s "diffusion model" -S arxiv -l 10 -d

# 也可以直接用 arXiv ID 下载
pdfget -m "2301.12345" -d

安装

系统要求

Python 3.12 或更高版本
推荐 uv 包管理器

安装方法

# 从 PyPI 安装
pip install pdfget

# 从源码安装
git clone https://github.com/gqy20/pdfget.git
cd pdfget
pip install -e .

使用 uv 运行

uv run pdfget -s "machine learning" -l 20

核心参数

必需参数（三选一）

-s QUERY - 搜索文献
-m INPUT - 批量输入（CSV文件/标识符）
--resume REPORT_OR_PLAN - 从运行报告重试失败项，或从下载计划继续执行
-S SOURCE - 选择搜索数据源

常用参数

-d - 下载 PDF（默认为统计模式）
-l NUM - 处理数量（默认 200）
-t NUM - 并发线程数（默认 3）
--delay SEC - 下载延迟秒数（默认 1.0）
--dry-run - 只生成搜索结果和下载计划，不实际下载
-o DIR - 输出目录（默认 data/pdfs）
-v - 详细输出

数据源选择

-S pubmed - PubMed（默认）
-S europe_pmc - Europe PMC
-S arxiv - arXiv
-S both - 同时使用 PubMed + Europe PMC
-S all - 同时使用 PubMed + Europe PMC + arXiv

API 配置（可选）

-e EMAIL - NCBI API 邮箱
-k KEY - NCBI API 密钥

获取 API 密钥：访问 NCBI 账户设置

使用示例

基础搜索和下载

# 搜索并显示 PMCID 统计
pdfget -s "cancer immunotherapy" -l 100

# 搜索并下载 PDF
pdfget -s "cancer immunotherapy AND pubmed pmc[sb]" -l 20 -d

# 搜索并下载 arXiv PDF
pdfget -s "vision transformer" -S arxiv -l 20 -d

# 跨三个数据源联合搜索
pdfget -s "large language model" -S all -l 30

# 指定输出目录
pdfget -s "machine learning" -l 50 -d -o ~/papers

# 仅预览下载计划，不实际下载
pdfget -s "vision transformer" -S all -l 30 -d --dry-run

CSV 批量下载

# 自动检测列名
pdfget -m identifiers.csv -d

# 指定列名
pdfget -m data.csv -c pmcid -d -t 5

# 直接下载 arXiv ID
pdfget -m "2401.01234,2301.12345" -d

# 调整下载速度
pdfget -m pmcids.csv -d --delay 0.5

失败续跑

每次下载都会在输出目录生成 download_plan.json、run_summary.json 和带时间戳的归档副本。报告包含每条记录的输入、下载结果、错误信息、失败分类、重试建议、每个下载来源的尝试明细和可重试的论文记录。 run_summary.json 同时包含下载计划中被跳过的记录，例如重复项、缺少下载路由或无法解析的标识符，并提供按状态、阶段、失败分类、跳过原因、重试原因和下载来源汇总的 stats。--resume 可以接收 run_summary.json 或 download_plan.json；前者默认只重试报告中标记为可重试的失败项，后者会按计划继续执行并依赖已有文件检查跳过已完成 PDF。

# 重试上一次失败的下载项
pdfget --resume data/pdfs/run_summary.json -o data/pdfs -t 3

# 从下载计划继续执行
pdfget --resume data/pdfs/download_plan.json -o data/pdfs -t 3

# 调整下载来源优先级
pdfget -m identifiers.csv -d --source-priority europe_pmc,pmc,arxiv,direct

Python API

如果需要在脚本中调用统一输入下载或只生成下载计划，可以使用包顶层导出的 façade：

from pdfget import (
    PaperFetcher,
    build_download_plan_from_unified_input,
    download_from_unified_input,
)
from pdfget.storage import LocalPDFStore

fetcher = PaperFetcher(output_dir="data/pdfs")

# 组合 façade：构造 plan + 执行下载（一步到位）
results = download_from_unified_input(
    fetcher,
    "identifiers.csv",
    column="ID",
    max_workers=3,
)

# 想要更细粒度：分两步走（CLI 也是这样做的）
plan = build_download_plan_from_unified_input(
    "PMC123456,10.1186/s12916-020-01690-4,2301.12345",
    resolver=fetcher,
    logger=fetcher.logger,
)
downloadable = [e for e in plan["entries"] if e["status"] == "ready"]
records = [e["paper"] for e in downloadable]

results = download_from_unified_input(
    fetcher,
    "identifiers.csv",
    column="ID",
    max_workers=3,
)

# 直接操作本地 PDF 存档（清理、列表、查路径）
store = LocalPDFStore("data/pdfs")
print(store.cache_info())          # 计数 / 字节数
print(store.path_for(records[0]))   # 期望 PDF 路径
store.cleanup_older_than(days=30)  # 清理 30 天前的旧 PDF

下载链路使用 download_plan.v1 作为边界协议。UnifiedDownloadManager.download_batch() 只接收计划产出的论文记录，不再接收裸 DOI 字符串列表；如果要处理 CSV、PMCID、PMID、DOI 或 arXiv 混合输入，请先使用 build_download_plan_from_unified_input() 或直接调用 download_from_unified_input()。搜索/统计输出可通过 --format console|json|markdown 控制，或调用 StatsFormatter.format(stats, format_type=...) 时使用 format_type="console" | "json" | "markdown"。

模块入口收口

类	公开 API（其他一律下划线化）
`PaperFetcher`	仍是 façade；新增 Python 入口用 `download_from_unified_input`
`PaperSearcher`	`search_papers(query, limit, source, *, require_pmcid, include_arxiv)`
`PDFDownloader`	`download_paper(record)`
`LocalPDFStore`	`path_for / has / open_writer / list_records / cleanup_older_than / cache_info`
`download_service`	`execute_download_plan(plan, ...)` + `download_from_unified_input(...)`
`UnifiedDownloadManager`	`download_batch(papers, timeout=30)`

PMC 过滤技巧

# 确保 100% 可下载（推荐）
pdfget -s "your-topic AND pubmed pmc[sb]" -l 50 -d

# 包含所有免费全文（部分可下载）
pdfget -s "your-topic filter[free full text]" -l 100

# 按年份过滤
pdfget -s "machine learning AND pubmed pmc[sb] 2020:2023[pd]" -l 30 -d

详细文档

完整使用指南请查看：📚 用户详细文档

如果你希望将 PDFGet 作为智能体或脚本的底层工具来使用，可查看结构化输出协议文档： 🧩 Schema Guide

详细文档包含：

高级检索语法
完整参数说明
输出格式详解
故障排除

许可证

MIT License - 详见 LICENSE 文件

Project details

Release history Release notifications | RSS feed

This version

0.1.6

Jun 29, 2026

0.1.5

Apr 1, 2026

0.1.4

Dec 24, 2025

0.1.3

Dec 22, 2025

0.1.2

Dec 9, 2025

0.1.0

Dec 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfget-0.1.6.tar.gz (179.9 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfget-0.1.6-py3-none-any.whl (75.7 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file pdfget-0.1.6.tar.gz.

File metadata

Download URL: pdfget-0.1.6.tar.gz
Upload date: Jun 29, 2026
Size: 179.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pdfget-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`92574b97830c6605a5abaa9794f5fa91ddb6114bb52b527e7320c6b20715ac0a`
MD5	`82e530f8fc3c50ddf0ad4bfc4066ff5f`
BLAKE2b-256	`db333148c52ee13f7a44d159bfccb3c6d6052b76e2690aed75def23335086b79`

See more details on using hashes here.

File details

Details for the file pdfget-0.1.6-py3-none-any.whl.

File metadata

Download URL: pdfget-0.1.6-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 75.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pdfget-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`392875209d62962ccb695fa28543eca0608604c7d947ef4253972286bf76b301`
MD5	`1647041890a7ee0504ee43a1183c7094`
BLAKE2b-256	`5b2aa9a60a1dfd16075637b249f9e69873e7e7d8fd123e67b3070df4d8040d9a`

See more details on using hashes here.

pdfget 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PDFGet - 智能文献搜索与批量下载工具

0.1.6 更新

0.1.5 更新

JSON 输出示例

新能力速览

项目概述

核心特性

快速开始

安装

5分钟上手

常见使用场景

场景1：搜索并下载可下载的文献

场景2：CSV 批量下载

场景3：统计开放获取情况

场景4：混合标识符下载

场景5：arXiv 搜索与下载

安装

系统要求

安装方法

使用 uv 运行

核心参数

必需参数（三选一）

常用参数

数据源选择

API 配置（可选）

使用示例

基础搜索和下载

CSV 批量下载

失败续跑

Python API

模块入口收口

PMC 过滤技巧

详细文档

许可证

相关链接

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes