Automated paper fetching and analysis platform.
Project description
pyPaperFlow
An automated literature processing platform for scientific researchers.
Batch retrieve, fetch, parse, and structure papers from PubMed, arXiv, bioRxiv, and DOI-based sources.
From paper retrieval to knowledge internalization, automate the heavy lifting and keep the judgment human.
If this project helps you, please consider giving it a Star โญ. Thank you!
Index
- pyPaperFlow
๐ Overview
An automated literature processing platform for scientific researchers. This tool focuses on information extraction and knowledge discovery stages, enabling researchers to efficiently complete the entire workflow from literature retrieval to knowledge internalization through a 7-stage automated process.
Core Objectives
- Rapid Domain Entry: Batch retrieve and access all available literature in a specific field
- Batch Knowledge Extraction: Utilize AI long-text processing capabilities to extract structured knowledge from massive amounts of text
- Research Trend Tracking: Quickly grasp the latest research methods, conclusions, and core papers in a field
Positioning
This tool is designed to complement rather than replace reference management software like Zotero. We focus on the two key steps of "Information Extraction" and "Knowledge Discovery" to build a structured knowledge base for you, laying the foundation for subsequent semantic search, content analysis, and review generation.
๐ Features
- Automated Retrieval from Multiple Sources: Automatically search and retrieve paper metadata and full-text records from
PubMed/Medline, arXiv, medRxiv, chemRxiv and bioRxiv. The repository focuses primarily on biomedical research and computational interdisciplinary fields (Biomedicine + Computational Biology). - Full-Text Access: Enable automatic downloading of open-access full texts in XML/Text format from
PMC. For preprints and other publications without accessible PMC full texts, alternative acquisition modules are integrated to fetchoriginal PDFs, withSci-Hubset as the fallback provider. - Structured Storage:
- Metadata: Preserved in well-structured detailed JSON files.
- Full Text: Stored in multiple formats including parsed JSON and Markdown for versatile downstream usage โ JSON for programmatic data analysis, and Markdown optimized for LLM comprehension and processing.
- Standardized Structured Parsing๏ผAll literatures are parsed and organized into
standardized JSON schemas. The schema strictly classifies content into metadata fields (title, year, authors) and canonical academic sections (abstract, introduction, results, discussion, methods, conclusion, supplementary, availability, funding, acknowledgements, author contributions, references, other).Custom section parsing is fully supported, allowing users to apply self-defined JSON schemas for semantic parsing of literature with special formatting structures. Dedicated modules are provided to extract designated sections from bulk topic-related papers andassemble them into source-verified Markdown literature corpora, facilitating subsequent literature investigation and systematic review writing.
- LLM & Agent Empowerment: Integrate LLM skills and intelligent Agent capabilities to streamline the entire workflow of literature investigation and in-depth reading.
- CLI Tool: Provide a user-friendly command-line interface
paperflowthat supports all core operations out of the box.
๐๏ธ Architecture Vision
You can check the Design.md for more details about our Design Philosophy.
The project is designed around a 7-stage workflow:
flowchart TD
A[Retrieval &<br>Collection] --> B[Processing &<br>Parsing]
B --> C[Structured<br>Extraction]
C --> D[Deep Encoding &<br>Vectorization]
D --> E[Dynamic Knowledge<br>Base Storage]
E --> F[Intelligent Interaction &<br>Discovery]
F --> G[Final Output &<br>Internalization]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#ffebee
style F fill:#f1f8e9
subgraph A [Stage 1: Highly Automatable]
direction LR
A1[Requirement Analysis] --> A2[Platform Search]
A2 --> A3[Initial Screening]
end
subgraph B [Stage 2: Highly Automatable]
direction LR
B1[Batch Download] --> B2[Format Parsing<br>PDF/HTML/XML]
B2 --> B3[Text Preprocessing]
end
subgraph C [Stage 3: Human-AI Collaboration Core]
direction LR
C1[Metadata Extraction] --> C2[Core Content Extraction<br>Abstract/Methods/Conclusion]
C2 --> C3[Relation & Viewpoint Extraction]
end
subgraph D [Stage 4: Fully Automatable]
direction LR
D1[Text Slicing] --> D2[Vector Embedding]
end
subgraph E [Stage 5: Fully Automatable]
direction LR
E1[Database Storage] --> E2[Vector Indexing]
end
subgraph F [Stage 6: Human-AI Collaboration Core]
direction LR
F1[Semantic Search] --> F2[Association Rec.] --> F3[Knowledge Graph Analysis] --> F4[Review & QA]
end
subgraph G [Stage 7: Human-Led]
direction LR
G1[Critical Reading] --> G2[Inspiration Generation] --> G3[Exp. Design &<br>Paper Writing]
end
๐ฆ Installation
# 1. install from source
git clone https://github.com/MaybeBio/pyPaperFlow.git
cd pyPaperFlow
pip install -e .
# 2. install MinerU
# follow the official installation guide: https://github.com/opendatalab/MinerU
# verify installation: mineru --help
pip install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple
pip install uv -i https://mirrors.aliyun.com/pypi/simple
uv pip install -U "mineru[all]" -i https://mirrors.aliyun.com/pypi/simple
# 3. install AI backend
pip install openai anthropic
# 4. install paperscraper backend
# follow the official installation guide: https://github.com/jannisborn/paperscraper
pip install paperscraper
โ ๏ธ For typical usage, you only need to install the repository from source and MinerU, which are steps 1 and 2.
๐ ๏ธ Usage
We designed pyPaperFlow as a versatile academic research tool built strictly around the realโworld workflow of researchers conducting literature investigation, paper reading, literature comprehension and analysis, and corpus utilization.
Therefore, please follow our stepโbyโstep operations, which mirror your full literature research process. Through this handsโon experience, you will fully grasp the design philosophy and usage of this tool.
The platform provides a CLI tool named paperflow.
Module Overview
Current available modules include (will be continuously updated):
โฏ paperflow --help
Usage: paperflow [OPTIONS] COMMAND [ARGS]...
pyPaperFlow CLI
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to copy it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ pubmed-search Search PubMed using Your customized query and return PMIDs. โ
โ pubmed-meta Fetch paper metadata from PubMed using Your customized query, pmid list file and save to storage. โ
โ pubmed-content Download full text (PMC) for given PMIDs if the paper has a PMC ID. โ
โ pubmed-all Fetch BOTH metadata and full text (if available) for papers. โ
โ Also extracts URLs from full text and updates metadata links. โ
โ pubmed-merge-json Create a merged JSON (or JSONL) file from PubMed paper directories. โ
โ pubmed-export-md Export a single Markdown view from a merged JSON file using optional YAML config. โ
โ arxiv-search Search arXiv and write matching IDs to a text file. โ
โ arxiv-fetch Fetch arXiv metadata and attempt to download PDFs. โ
โ biorxiv-search Search bioRxiv and write matching IDs to a text file. โ
โ biorxiv-fetch Fetch bioRxiv metadata and attempt to download PDFs. โ
โ paper-fetch Fetch PDFs by DOI โ passes through to the paper-fetch engine. โ
โ pdf-parse Parse a PDF file using MinerU engine, and clean up the output directory. โ
โ mineru-parse Parse mineru output content_list_v2.json into canonical sectioned JSON. โ
โ mineru-export-md Export structured mineru JSON to a clean Markdown file for LLM processing. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Classify these modules according to the workflow stages:
PubMed Modules:
- pubmed-search # search papers and return PMIDs
- pubmed-meta # fetch paper metadata from PubMed
- pubmed-content # download full text (PMC) for given PMIDs if the paper has a PMC ID
- pubmed-all # fetch BOTH metadata and full text (if available) for papers
- pubmed-merge-json # Batch merge a collection of PubMed papers of the same topic
- pubmed-export-md # export PubMed paper collections as Markdown files, supporting batch export of specific sections (๐ e.g., batch export of introductions as your research background)
arXiv Modules:
- arxiv-search # search arXiv and return matching IDs
- arxiv-fetch # fetch arXiv metadata and attempt to download PDFs
bioRxiv Modules:
- biorxiv-search # search bioRxiv and return matching IDs
- biorxiv-fetch # fetch bioRxiv metadata and attempt to download PDFs
Third-party Modules:
- paper-fetch # fetch PDFs by DOI
- pdf-parse # parse PDF files into JSON, Markdown format using the MinerU engine
- mineru-parse # Based on your custom section configuration, re-parse the MinerU output file into a structured JSON format clustered by standard literature sections
- mineru-export-md # Based on your custom section configuration, export the structured mineru JSON to a clean Markdown file for LLM processing (๐ e.g., batch export of introductions as your research background)
โ ๏ธ
Other preprint platforms modules are under development, please stay tuned!
1. Research Start Point
The primary step in conducting a literature review is the collection and organization of literature information. When existing knowledge reserves are insufficient, academic materials need to be integrated to systematically grasp the domestic and international research status in relevant fields.
First, the intended research topic must be defined. At the initial stage of research, you may only have scattered preliminary ideas, fragmented literatures, rough investigation drafts, or even no prior materials at allโmerely several core keywords.
In this phase, the research direction and scope shall be preliminarily defined based on all available information. Only broad research boundaries need to be determined here; there is no need to precisely finalize the ultimate research objective in the first iteration.
Accordingly, priori or posteriori brainstorming is required. This tool features dedicated builtโin functional modules to help you organize existing ideas and information, and refine them into wellโdefined research directions and scopes.
Inputs:
- Research Direction: The intended research topic or problem domain
- Existing Information: Related literatures, investigation drafts, keywords and other prior materials you have obtained, with attachments supported
Outputs:
- Research Scope: An explicit definition covering core topics and boundary constraints. More intuitively, it can be regarded as preliminary research questions or the overall research orientation, uniformly defined as the Starting Point of Research in this document.
- Output is mainly presented as a keyword list guiding subsequent literature retrieval or standardized research question statements. Constraints can be supplemented through multiple iterations according to research requirements.
Core Note:
The Starting Point of Research is not finalized once and for all. It can be continuously updated and refined through multiple iterations with newly acquired information and research progress.
You may leverage stateโofโtheโart large language models, combined with all materials and information at hand, to repeatedly verify and refine the Starting Point of Research until it is sufficiently clear and specific, or meets the criteria to proceed to the next step of literature retrieval.
๐ Here we provide a few brainstorming skills for literature review: Skills List
2. Search Papers (and Fetch Metadata)
Once the starting point of research is finalized (or any intermediate brainstorming stage requiring supplementary literature review), you may proceed with paper retrieval.
This tool does not generate search queries for you. Instead, we highly recommend crafting grammatically standardized and highโrelevance queries prior to using our search module.
Our literature database primarily covers biomedical research and computational interdisciplinary fields, with core data sources as follows:
- PubMed/Medline
- arXiv
- bioRxiv๏ผmedRxiv๏ผchemRxiv
We recommend that you proactively learn and master the search syntax of these databases, as our builtโin search module functions similarly to the search bar on official web portals.
For instance, here is a typical complex query example tailored for PubMed:
"""
(
"Intrinsically Disordered Proteins"[Mesh] OR
"Intrinsically Disordered Protein"[Title/Abstract] OR
"Intrinsically Disordered Proteins"[Title/Abstract] OR
"Intrinsically Disordered Region"[Title/Abstract] OR
"Intrinsically Disordered Regions"[Title/Abstract] OR
"Natively Unfolded Protein"[Title/Abstract] OR
"Natively Unfolded Proteins"[Title/Abstract] OR
"Unstructured Protein"[Title/Abstract] OR
"Unstructured Proteins"[Title/Abstract] OR
"IDR"[Title/Abstract] OR
"IDP"[Title/Abstract]
)
AND
(
"Protein Interaction Maps"[Mesh] OR
"Protein Interaction Maps"[Title/Abstract] OR
"Protein Interaction Networks"[Title/Abstract] OR
"Protein-Protein Interaction Map"[Title/Abstract] OR
"Protein-Protein Interaction Network"[Title/Abstract] OR
"Protein Interaction Mapping"[Mesh] OR
"Protein Interaction Mapping"[Title/Abstract] OR
"Binding Sites"[Title/Abstract] OR
"Protein Binding"[Title/Abstract] OR
"Protein Interaction Domains and Motifs"[Title/Abstract] OR
"Protein Interaction Maps"[Title/Abstract] OR
"Protein Interaction Domains and Motifs"[Mesh] OR
"Protein Interaction"[Title/Abstract] OR
"Protein-Protein Interaction"[Title/Abstract] OR
"PPI"[Title/Abstract] OR
"Interaction"[Title/Abstract] OR
"Binding"[Title/Abstract] OR
"Interface"[Title/Abstract] OR
"Complex"[Title/Abstract]
)
AND
(
"Artificial Intelligence"[Mesh] OR
"Deep Learning"[Mesh] OR
"Machine Learning"[Mesh] OR
"Neural Networks, Computer"[Mesh] OR
"Artificial Intelligence"[Title/Abstract] OR
"Deep Learning"[Title/Abstract] OR
"Machine Learning"[Title/Abstract] OR
"Neural Network"[Title/Abstract]
)
AND (
"2023/01/01"[Date - Publication] : "2026/12/31"[Date - Publication]
)
"""
Once you finish constructing your search query, you can start searching for papers. We will use the PubMed-related API as an example.
โฏ paperflow pubmed-search --help
Usage: paperflow pubmed-search [OPTIONS] QUERY
Search PubMed using Your customized query and return PMIDs.
Notes:
- 1, This command only searches and returns PMIDs, it does not fetch paper metadata.
- 2, This command will print the found PMIDs and also save them to 'pubmed_searched_ids.txt' in the specified output
directory.
If --output-dir is not specified, it will default to the storage directory.
- 3, Note that storage_dir is used to initialize the fetcher for consistency, while output_dir is where the PMIDs are saved.
They are different parameters!
Example usage:
- 1. Search for papers related to "machine learning" and return up to 500 PMIDs/per batch:
paperflow pubmed-search "machine learning" --retmax 500 --output-dir ./MyPapers --email "YOUR_EMAIL@example.com" --api-key
"YOUR_NCBI_API_KEY"
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * query TEXT PubMed search query. [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --retmax -n INTEGER Max number of PMIDs to return every batch, must less than 10000. [default: 500] โ
โ * --email TEXT Entrez Email. [required] โ
โ --api-key TEXT NCBI API Key (recommended). โ
โ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. โ
โ [default: ./Papers] โ
โ --output-dir -o TEXT Directory in result-level to store output IDs. โ
โ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
At this stage, we recommend retrieving paper metadata (primarily abstracts) via literature search.
Literature collection is an iterative process. You can often identify target papers using only abstracts, then proceed to download the required papers in the next step. In some cases, you may still need to download all retrieved papers.
It is important to emphasize that you can re-enter the brainstorming phase at any stage. The output of each phase can serve as the input for subsequent literature research. Based on the output of this phase, you can conduct further brainstorming to refine your research starting point and define your research questions more precisely.
โฏ paperflow pubmed-meta --help
Usage: paperflow pubmed-meta [OPTIONS]
Fetch paper metadata from PubMed using Your customized query, pmid list file and save to storage.
Notes:
- 1, You must provide one of --query, or --file to specify which papers to fetch. Note that they are mutually exclusive.
- 2, -f can be used to fetch one or more PMIDs listed in a text file (one PMID per line).
Example usage:
- 1. Fetch papers for a query and save to storage:
paperflow pubmed-fetch --query "machine learning" --output-dir ./MyPapers --email "YOUR_EMAIL@example.com" --api-key "YOUR_NCBI_API_KEY"
- 2. Fetch papers from a list of PMIDs in a file:
paperflow pubmed-fetch --file ./pmid_list.txt --output-dir ./MyPapers --email "YOUR_EMAIL@example.com" --api-key "YOUR_NCBI_API_KEY"
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --query -q TEXT PubMed search query. โ
โ --file -f TEXT Text file containing PMIDs (one per line), -q and -f are mutually exclusive. โ
โ --batch-size -b INTEGER Batch size for fetching. [default: 50] โ
โ * --email TEXT Entrez Email. [required] โ
โ --api-key TEXT NCBI API Key (recommended). โ
โ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. [default: ./Papers] โ
โ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] โ
โ --output-dir -o TEXT Directory in result-level to store output papers, default is current directory. If not specified, will be set to root โ
โ directory of the repository-level which is storage_dir. ๐ We will create a '/pubmed' subfolder under the output โ
โ directory to save all pubmed related data โ
โ [default: .] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
3. Fetch Papers (and Download Full Text)
Once you have confirmed your target papers, or, worse case, the metadata obtained during the search phase is insufficient for further evaluation and you need to download all fullโtext papers, you may start downloading the papers.
Take PubMed as an example: for PubMed papers, we prioritize downloading full texts from PMC if available. If no PMC full text exists, we only retrieve PubMed metadata (mainly abstracts) and basic paper information.
Additionally, we provide a dedicated PDFโcrawling module as a fallback strategy for paper acquisition. Manual retrieval of PDF files is only recommended when all aforementioned methods fail to obtain PubMed paper data.
Output files from the PubMed database are available in two formats: JSON and Markdown. JSON is recommended for subsequent analysis, while Markdown serves as input data for Large Language Models (LLMs). Our tool generates both file formats for your selection simultaneously.
โฏ paperflow pubmed-content --help
Usage: paperflow pubmed-content [OPTIONS]
Download full text (PMC) for given PMIDs if the paper has a PMC ID.
Notes:
- 1, This currently only supports PMC full text fetching if the paper has a PMC ID.
Example usage:
- 1. Download full text for PMIDs listed in a file:
paperflow download-fulltext --file ./pmid_list.txt --email "YOUR_EMAIL@example" --api-key "YOUR_NCBI_API_KEY" --output-dir ./MyPapers
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --file -f TEXT File containing PMIDs (one per line). โ
โ * --email TEXT Entrez Email. [required] โ
โ --api-key TEXT NCBI API Key (recommended). โ
โ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. [default: ./Papers] โ
โ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] โ
โ --output-dir -o TEXT Directory in result-level to store output full texts, default is current directory. If not specified, will be set to root โ
โ directory of the repository-level which is storage_dir. ๐ We will create a '/pubmed' subfolder under the output directory โ
โ to save all pubmed related data โ
โ [default: .] โ
โ --pmid -p TEXT Single PMID to download full text for, can be repeated. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Alternatively, you may perform metadata retrieval and content fetching in two separate steps; we recommend handling them separately.
โฏ paperflow pubmed-all --help
Usage: paperflow pubmed-all [OPTIONS]
Fetch BOTH metadata and full text (if available) for papers. Also extracts URLs from full text and updates metadata links.
Example usage:
- 1. Fetch full papers for a query:
paperflow pubmed-all --query "machine learning" --output-dir ./MyPapers --email "YOUR_EMAIL"
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --query -q TEXT PubMed search query. โ
โ --file -f TEXT Text file containing PMIDs (one per line), -q and -f are mutually exclusive. โ
โ --pmid -p TEXT Single PMID to download full text for, can be repeated. โ
โ --batch-size -b INTEGER Batch size for fetching. [default: 50] โ
โ --max-retries INTEGER Maximum number of retries for Entrez API calls. [default: 3] โ
โ * --email TEXT Entrez Email. [required] โ
โ --api-key TEXT NCBI API Key (recommended). โ
โ --storage-dir -s TEXT Directory in Repository-level to store paper data for Initialization. [default: ./Papers] โ
โ --output-dir -o TEXT Directory in result-level to store output papers. If not specified, defaults to storage-dir. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
For PubMed papers without PMC full texts, or papers from other databases where only the DOI is available (the pubmedโmeta module guarantees DOI acquisition), you may directly download full texts by DOI (if openโaccess versions exist).
โฏ paperflow paper-fetch --help
usage: paper-fetch [-h] [--title TITLE] [--batch FILE] [--out DIR] [--dry-run] [--format {json,text}] [--pretty] [--stream] [--overwrite]
[--idempotency-key KEY] [--timeout SECONDS] [--version]
[doi]
Fetch legal open-access PDFs by DOI via Unpaywall, Semantic Scholar, arXiv, PMC, and bioRxiv/medRxiv.
positional arguments:
doi DOI to fetch (e.g. 10.1038/s41586-020-2649-2). Use '-' to read from stdin.
options:
-h, --help show this help message and exit
--title TITLE paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI / --batch.
--batch FILE file with one DOI per line for bulk download. Use '-' to read from stdin.
--out DIR output directory (default: pdfs)
--dry-run resolve sources without downloading; preview the PDF URL and filename
--format {json,text} output format. json for agents, text for humans. Default: json when stdout is not a TTY, text otherwise.
--pretty pretty-print JSON output (2-space indent)
--stream emit one NDJSON result per line on stdout as each DOI resolves (batch mode)
--overwrite re-download even if the destination file already exists
--idempotency-key KEY
safe-retry key; re-running with the same key replays the original envelope from <out>/.paper-fetch-idem/
--timeout SECONDS HTTP timeout in seconds per request (default: 30)
--version show program's version number and exit
exit codes:
0 all DOIs resolved successfully
1 unresolved (some DOIs had no OA copy; no transport failure)
3 validation error (bad arguments)
4 transport error (network / download / IO failure; retryable class)
subcommands:
schema print the machine-readable CLI schema and exit (no network)
stdin:
paper-fetch - read a single DOI from stdin
paper-fetch --batch - read DOIs line-by-line from stdin
output:
stdout emits one JSON object per invocation (NDJSON with --stream).
stderr emits NDJSON progress events when --format json, prose when --format text.
stdout format auto-detects TTY: json when piped/captured, text in a terminal.
examples:
paper-fetch 10.1038/s41586-020-2649-2
paper-fetch 10.1038/s41586-020-2649-2 --dry-run
paper-fetch --batch dois.txt --out ./papers --format text
echo 10.1038/s41586-020-2649-2 | paper-fetch --batch -
paper-fetch schema
We acknowledge the work of paper-fetch๏ผWe have modified, refactored, and encapsulated one of its core scripts for tailored integration into our pipeline.
The workflow of our paper acquisition module is outlined below:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input: DOI / Paper Title / Batch File โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Titleโbased Resolution? โ Crossref โ Semantic Scholar
โ (Resolves to DOI with confidence score) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Unpaywall (requires UNPAYWALL_EMAIL) โ
โ โ Fastest openโaccess (OA) links with metadata
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Failure / Skip โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. Semantic Scholar โ
โ โ PDF URLs + external identifiers (arXiv / PMCID)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Failure โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. arXiv (via S2 externalIds.ArXiv) โ
โ 4. Europe PMC โ PMC (via PMCID) โ
โ 5. bioRxiv / medRxiv (DOI prefix: 10.1101/)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total Failure โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 6. Publisher Direct Links (Institutional Mode Only)
โ Nature / Science / Elsevier / Springer, etc.
โ Requires institutional IP / subscription / EZproxy authorization
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Persistent Failure โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 7. SciโHub Mirror Fallback (enabled by default, configurable)
โ โ 1 requestโperโsecond rateโlimiting to prevent CAPTCHA triggers
โ โ Automatic discovery of active new mirrors
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Resolution Priority Sequence
Unpaywall: The optimal openโaccess source covering the broadest range of publishers with the highest hit rate.
Semantic Scholar: Retrieves OA PDF links and crossโplatform external identifiers.
arXiv: Activated when an arXiv identifier is available for the target paper.
PubMed Central (PMC) OA Subset: Activated when a PMCID is associated with the paper.
bioRxiv / medRxiv: Triggered for preprints with the DOI prefix 10.1101/.
Publisher Direct Links: Enabled only under institutional mode (PAPER_FETCH_INSTITUTIONAL=1), authorized via the callerโs institutional subscription IP, cookies, or EZproxy access.
SciโHub Mirror Fallback: Enabled by default as the final retrieval backup.
Mirrors are attempted in the order specified by the environment variable PAPER_FETCH_SCIHUB_MIRRORS (default list: sciโhub.ru, sciโhub.st, sciโhub.su, sciโhub.box, sciโhub.red, sciโhub.al, sciโhub.mk, sciโhub.ee).
If all predefined mirrors fail, the module fetches the latest live mirror list from https://www.sciโhub.pub/ and retries.
Set PAPER_FETCH_NO_SCIHUB=1 to disable SciโHub retrieval.
If all sources fail, metadata is returned with a recommendation for interlibrary loan (ILL) acquisition.
โ ๏ธ Prior to using the paperโfetch module, configure your Unpaywall contact email via environment variable:
export UNPAYWALL_EMAIL=you@example.com
Unlike PMC parsing, nonโPubMed papers can only be obtained as PDF files via the paperโfetch module.
We recommend standardizing all paper information into Markdown or JSON formats.
Given subsequent requirements for paragraph segmentation and information extraction, JSON is the most suitable intermediate format for programmatic processing.
We provide a pdfโparser module that parses input PDFs into preliminary Markdown and JSON files using MinerU.
Refer to official documentation for details. Since typical users lack sufficient GPU resources for acceleration, we use the basic parsing mode by default (pipeline backend).
โฏ paperflow pdf-parse --help
Usage: paperflow pdf-parse [OPTIONS]
Parse a PDF file using MinerU engine, and clean up the output directory.
Notes:
- 1, MinerU generates a subfolder /auto under --output with .md, .json, .pdf, and images/. Use --clear to strip anything
unnecessary,
note that we only use .md files and _content_list_v2.json/_content_list.json files for further processing like structuring.
- 2, โ ๏ธ Remember to switch to domestic mirror source when you can not access huggingface.
Example usage:
paperflow pdf-parse -i paper.pdf -o ./output
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --input -i TEXT Input PDF file path. [required] โ
โ * --output -o TEXT Output directory for parsed output. [required] โ
โ --clear After conversion, keep only the .md files and necessary .json โ
โ files(_content_list_v2.json/_content_list.json). โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Regarding the PDF paper retrieval module, we also provide a suite of reference scripts, which can be integrated into existing skills or implemented independently: Paper pdf fetch
4. Literature Content Extraction and Structured Processing
In the preceding stage, we acquired metadata and textual content of academic papers:
- For PubMed papers: Metadata is retrieved, and fullโtext content is downloaded from the PubMed Central (PMC) database (when available), then parsed into Markdown and JSON formats.
- For nonโPubMed papers: PDF files are obtained via Digital Object Identifiers (DOIs) and parsed using the MinerU parsing engine, with outputs standardised to Markdown and JSON formats.
The generated Markdown files from both sources serve as viable fullโtext alternatives for direct literature reading, yet they are not amenable to chapterโlevel extraction and standardised processing.
By contrast, JSON files retain raw parsed outputs with intricate structures, containing comprehensive textual content and positional metadata, but lack standardisation for direct downstream utilisation.
This stage processes raw JSON files by parsing and classifying textual segments to produce standardised, chapterโorganised JSON outputs.
Specifically, content is extracted and partitioned into canonical academic sections as listed below (with minor configurable variations in section delineation):
metadata(title,year,authors)
abstract
introduction
results
discussion
methods
conclusion
supplementary
availability
funding
acknowledgements
author contributions
references
other
Our objective is to fundamentally segment papers into fixed canonical sections aligned with the internal structural conventions of individual publications and the core downstream analytical demands of researchers. Teleologically, this standardised partitioning enables scholars to review and utilise literature knowledge within a consistent cognitive framework.
For PubMed papers, textual data is sourced from the PMC database; accordingly, our parsing workflow commences with JSON outputs generated from PMC parsing responses.
To preserve complete data provenance (not all PubMed papers have PMC fullโtext access), we implement two modular components for structured extraction and representation of PubMed literature.
First, metadata and textual content (where PMC fullโtext exists) are merged to generate a single JSON file encapsulating complete paper information:
โฏ paperflow pubmed-merge-json --help
Usage: paperflow pubmed-merge-json [OPTIONS]
Create a merged JSON (or JSONL) file from PubMed paper directories.
This produces a canonical merged JSON representation per paper and is
intended as the first stage in a two-stage pipeline (merge-json -> export-md).
Example usage:
- 1. Merge JSON files for all papers in a directory:
paperflow pubmed-merge-json --input ./MyPapers --output ./MyPapers
- 2. Merge JSON files for PMIDs listed in a file:
paperflow pubmed-merge-json --input ./MyPapers --output ./MyPapers --pmid-file ./pmid_list.txt --jsonl
--stats-path ./MyPapers/stats
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --input -i TEXT Directory containing paper data โ
โ ({INPUT_PAPER_DIR_HERE}/pubmed/year/pmid/structure). โ
โ [required] โ
โ * --output -o TEXT Output directory or file path. If a directory or path without extension is given, โ
โ the merged file is auto-named as โ
โ <input-directory-base-name>_<datetime>.json/.jsonl. โ
โ [required] โ
โ --pmid-file -p TEXT File containing PMIDs to merge (one per line). โ
โ --jsonl Write output as JSONL, one JSON per line. โ
โ --stats-path -s TEXT Optional path to save merge statistics file, defaults to current directory. โ
โ [default: .] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Designed primarily for batchโprocessing workflows to enable bulk content extraction from standardised section schemas, this module also supports singleโpaper processing via fileโlevel specification.
By default, independent merging is executed for all PubMed papers within the input directory, and JSON files corresponding to specified PMID inventories are further aggregated into a single consolidated JSON file. This workflow is particularly suited for compiling papers on a unified research topic to construct preliminary literature knowledge bases.
This aggregated JSON file serves as the input for subsequent structured classification and extraction:
โฏ paperflow pubmed-export-md --help
Usage: paperflow pubmed-export-md [OPTIONS]
Export a single Markdown view from a merged JSON file using optional YAML config.
Notes:
- 1, The input merged JSON/JSONL should be produced by the pubmed-merge-json command, which
creates a canonical representation of paper metadata and content.
- 2, The optional YAML config can specify which metadata fields and content sections to include
in the Markdown output. If not provided, it defaults to including basic metadata and the FULL
content.
Example usage:
- 1. Export Markdown for all papers in a merged JSON:
paperflow pubmed-export-md --input ./MyPapers/merged.jsonl --output ./MyPapers/exported.md
--config ./config.yaml
- 2. Export Markdown for PMIDs listed in a file:
paperflow pubmed-export-md --input ./MyPapers/merged.jsonl --output ./MyPapers/exported.md
--config ./config.yaml --pmid-file ./pmid_list.txt
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --input -i TEXT Path to merged JSON or JSONL produced by pubmed-merge-json. โ
โ [required] โ
โ * --output -o TEXT Output Markdown file path. [required] โ
โ --config -c TEXT YAML config file specifying metadata_fields and โ
โ content_sections. If not provided, defaults to basic metadata โ
โ and FULL content. โ
โ --pmid-file -p TEXT Optional PMID file to filter exported papers. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Metadata keyโvalue pairs for each paper follow a fixed schema:
content
abstract # abstract text, ๐ important
keywords # keywords, ๐ important
mesh_terms # mesh terms, ๐ important
pub_types # article or review, can be used for filtering, ๐ important
contributors
medline # contributors parsed from medline format, MIXED PERSONS PER DICT, LESS DETAILED
affiliations # affiliations of contributors
auids # ORCID
full_names # full names of contributors
short_names # short names of contributors, ๐ important for citation
xml # contributors parsed from xml format, ONE PERSON PER DICT, MORE DETAILED
affiliations # same as above
full_name
identifiers
short_name
identity
doi # DOI of the paper, ๐ important, can be used for DOI-based fetching module
pmid # PubMed ID, ๐ important
title # title of the paper, ๐ important
links
cites # cite this paper, ๐ important
entrez # other entrez links
external # other external database links, ONE LINK PER DICT, MORE DETAILED (โ ๏ธ there may be Full text source)
attribute
category
linkname
provider
url # URL of the external database link, ๐ important
pmc # PMC ID used to download full text, ๐ important
refs # (pmid) cited by this paper, ๐ important
review # (pmid) All review articles highly relevant to the theme of this paper , ๐ important
similar # (pmid) topic-similar papers, ๐ important
text_mined # links mined from PMC full text(if available), ๐ important (there may be github links or other sources)
metadata
entrez_date # date when the paper was added to PubMed
fetched_at # date when the paper was fetched by our tool
source
journal_abbrev # abbreviation abbreviation of the journal
journal_title # full name of the journal
pub_date # publication date
pub_types # publication types, similar to pub_types in content above
pub_year # publication year, ๐ important for citation
Semantic segmentation and classification are applied exclusively to textual content.
Within the batchโexport module pubmed-export-md, the -c parameter accepts a YAML configuration file for section extraction pubmed export yaml, enabling bulk extraction of designated sectionsโfor instance, batch retrieval of introduction sections for background research.
โ ๏ธ Keys within this YAML file are fixed; users may only comment out specific keys to extract targeted sections, or retain default settings to export all sections.
metadata_fields:
- identity.title
- identity.pmid
- identity.doi
- content.keywords
- content.mesh_terms
- content.pub_types
- content.abstract # abstract in metadata first, fall back in content sections(deprecated)
- contributors.medline
- contributors.xml
- links.cites
- links.entrez
- links.external
- links.pmc
- links.refs
- links.review
- links.similar
- links.text_mined
- metadata.entrez_date
- metadata.fetched_at
- source.journal_abbrev
- source.journal_title
- source.pub_date
- source.pub_types
- source.pub_year
content_sections:
- abstract
- introduction
- methods
- results
- discussion
- conclusion
- supplementary
- availability
- funding
- acknowledgements
- author_contributions
The core parsing logic is illustrated below:
flowchart TD
A[Initiate Markdown Export] --> B{YAML Config Provided?}
B -- Yes --> C[Load yaml_cfg]
C --> D[Parse metadata_fields / content_sections]
D --> E[Write paperโlevel title and metadata]
E --> F[Extract section tree from content.body]
F --> G[_extract_section_records: raw sections โ structured records]
G --> H[_normalize_section_title: map to canonical_type]
H --> I[_order_section_records: sort per content_sections]
I --> J[_aggregate_section_records: merge identical canonical_type entries]
J --> K{canonical_type in content_sections?}
K -- No --> L[Skip section]
K -- Yes --> M[_render_section_records: format as Markdown headings]
M --> N[Insert paper separator]
L --> N
B -- No --> O[Omit section mapping]
O --> P[Write paperโlevel title and metadata]
P --> Q{content.body Exists?}
Q -- Yes --> R[Recursively expand raw section tree]
R --> S[render_raw_content_tree: output title/content/subsections directly]
Q -- No --> T[Supplement abstract from metadata]
T --> U[Output metadata fields + abstract]
S --> N
U --> N
N --> V[Process Next Paper]
V --> W[Terminate Export]
The above workflow describes structured extraction for PubMed papers. For nonโPubMed publications, parsing commences with preliminary JSON outputs๏ผcontent_list_v2.json๏ผgenerated by the MinerU parsing engine.
The content_list_v2.json file generated by processing PDFs with MinerU organizes data on a page-by-page basis: an outer array represents all pages, and each element is a list of rendered blocks for that page. These blocks include diverse types such as paper titles, paragraphs, interline equations, images/charts, tables, page headers, footers, and footnotes, which are mixed together and cannot be directly used for downstream semantic analysis or LLM input.
Our goal is to convert this raw JSON into a unified, structured JSON organized by standard sections in the literature domain.
Input JSON structure:
[
[ // page 0
{"type": "title", "content": {"title_content": [...], "level": 1}},
{"type": "paragraph", "content": {"paragraph_content": [...]}},
{"type": "title", "content": {"title_content": [...], "level": 2}},
{"type": "paragraph", "content": {"paragraph_content": [...]}},
{"type": "page_header", ...}, // noise
{"type": "page_footnote", ...}, // noise
...
],
[ // page 1
...
]
]
Common block types (categorized by content values):
| Type | Is Main Content | Text Extraction Path |
|---|---|---|
title |
Yes (Section Anchor) | content.title_content[*].content + level (1 = Paper Title, 2 = Primary Section) |
paragraph |
Yes (Main Text) | content.paragraph_content[*].content, supports equation_inline sub-items |
equation_interline |
Yes (Interline Equation) | content.math_content (LaTeX) |
table |
Partial | content.html (HTML Table) + content.table_caption |
image / chart |
No (Caption Preserved) | content.image_caption[*].content / content.chart_caption |
page_header / page_footer / page_footnote |
Noise (Discarded) | Used for metadata scanning (year/DOI/journal name) |
Our parsing pipeline is as follows:
content_list_v2.json
โ
โโโโโโโโโโโโโโโโโ Step 1: Flattening โโโโโโโโโโโโโโโโโ
โ
_flatten() โ Remove noise blocks
(page_header/footer/footnote)
Preserve title / paragraph / table, etc.
โ
โโโโโโโโโโโโโโ Step 2: Metadata Extraction โโโโโโโโโโโโโโโโ
โ
โโ title โ First level=1 title block
โโ authors โ First short line after title (contains commas, <400 characters)
โโ year โ Extract "2025" from page_footer
โโ doi โ Match "10.1002/..." from page_footnote
โโ journal โ Select all-uppercase short name from page_header
โ
โโโโโโโโโโโโโโ Step 3: Abstract Extraction โโโโโโโโโโโโโโโโโโ
โ
_extract_abstract()
Skip author lines โ Collect all paragraphs before the first section
โ
โโโโโโโโโโ Step 4: Section Segmentation โโโโโโโโโโโโโโโโโโโโโ
โ
โ Split paragraphs by title blocks:
โ level=1 โ Skip (Paper Title)
โ level=2 โ New Primary Section
โ level>=3 or numbered "2.1." โ Subsection, grouped under parent section
โ
โโโโโโโโโโค Step 5: Title Normalization โโโโโโโโโโโโโโโโโโโโโ
โ
โ normalize_section_title()
โ Remove numeric prefixes "2.2. IDPFold..." โ "IDPFold..."
โ Match CANONICAL_TYPES table โ "results"
โ
โโโโโโโโโโค Step 6: Section Aggregation โโโโโโโโโโโโโโโโโโโโโโโ
โ
โ _aggregate_sections()
โ Merge content with the same canonical_type
โ Preserve subsections list
โ
โโโโโโโโโโ Step 7: Table Extraction โโโโโโโโโโโโโโโโโโโโโ
โ
_extract_tables()
Collect html + caption of all table blocks
โ
โผ
Structured Output JSON
This JSON schema is more complex and less straightforward to parse than PMCโderived JSON files.
Analogous to the PubMed processing pipeline, two sequential modules are deployed for structured extraction of nonโPubMed JSON outputs.
The combination mineruโparse + mineruโexportโmd serves as an enhanced counterpart to pubmedโmergeโjson + pubmedโexportโmd.
โฏ paperflow mineru-parse --help
Usage: paperflow mineru-parse [OPTIONS]
Parse mineru output content_list_v2.json into canonical sectioned JSON.
Extracts metadata (title, authors, year, DOI, journal),
and sections normalised to canonical types (abstract, introduction, results,
discussion, methods, etc.). Tables are preserved as HTML.
Notes:
- 1, Two backends: 'regex' (pattern + context, no API) and 'ai' (LLM batch classification).
- 2, AI backend supports Anthropic native, OpenAI native, and any OpenAI-compatible
endpoint via --base-url (DeepSeek, university proxies, self-hosted, etc.).
- 3, Set the appropriate API key env var (ANTHROPIC_API_KEY, OPENAI_API_KEY,
DEEPSEEK_API_KEY) or pass --api-key.
- 4, Configure provider/model via --model, --base-url, or a YAML config file.
Examples:
paperflow mineru-parse -i content_list_v2.json -o paper.json
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend ai
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend ai \
--base-url https://api.deepseek.com --model deepseek-v4-pro --api-key sk-xxx
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend ai \
--base-url https://models.sjtu.edu.cn/api/v1 --model deepseek-chat
paperflow mineru-parse -i content_list_v2.json -o paper.json --backend regex --config custom.yaml
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --input -i TEXT Path to mineru content_list_v2.json. [required] โ
โ * --output -o TEXT Output path for the structured JSON file. [required] โ
โ --backend -b TEXT Section classification backend: 'regex' (default, no API needed) or 'ai'. โ
โ [default: regex] โ
โ --config -c TEXT Path to YAML config file for canonical types, aliases, and AI settings. โ
โ --api-key TEXT API key for AI backend. Overrides config file and env var. โ
โ --model TEXT Override AI model (e.g. 'deepseek-v4-pro', 'claude-haiku-4-5', 'gpt-4o-mini'). โ
โ --base-url TEXT Custom API base URL for OpenAI-compatible endpoints (e.g. โ
โ 'https://api.deepseek.com'). โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
mineru-parse transforms flat JSON outputs from MinerU into standardised structured JSON, classifying each segment into canonical academic sections while extracting metadata (title, authors, year, DOI, journal) and figure captions.
Two backends are provided for textual segment parsing:
Two Backends
| Backend | How it works | API needed? | Best for |
|---|---|---|---|
| regex (default) | Pattern matching: exact string โ regex โ context keyword. Configurable via YAML. | No | Common papers, batch processing |
| ai | Sends all section titles + context to an LLM in one batch API call. | Yes | Non-standard titles, multi-publisher |
1. Regex matching layers ๏ผ
1. strong (exact match) โ "Introduction" == "introduction" โ
2. weak (regex search) โ "1. Introduction" matches r"introduction" โ
3. context_keywords โ "Overview" โ check text for "we used..." โ methods
4. fallback โ classify as "other"
A sliding positional pointer tracks document sequence to minimise misclassification: subsequent section matching initiates from the endpoint of the preceding matched segment rather than the document start.
2. AI workflow ๏ผ
content_list_v2.json
โ extract all titles + surrounding text (~200 chars)
โ build JSON payload: [{index, title, context_preview}, ...]
โ one API call โ AI returns {classifications: [{index, canonical_type}]}
โ merge classifications into structured JSON
โ ๏ธ The regex backend is enabled by default; the AI backend is under active development. ๐ For the
-cparameter ofmineruโparse, please refer to the provided template configuration file mineru config file. Default settings suffice for general usage without modification. This configuration file is engineered for compatibility with both regex and AI backends, with documentation and revision guidelines embedded within the file.
All matching rules are encapsulated within mineru_config.yaml, with sensible defaults preconfigured. Modifications are only required for journalโspecific adaptation.
Users may globally customise section categorisation and individually classify arbitrary textual segments according to personal reading and downstream analytical requirements.
๐ This enables highly personalised section parsing: theoretically, custom section schemas and parsing logic can be tailored for any paper type.
Config file layout ๏ผ
| Section | Purpose |
|---|---|
ai |
model, api_key, base_url for AI backend |
canonical_order |
Which types exist + their output order |
display_names |
Human-readable labels (can be Chinese, etc.) |
aliases |
Matching rules: strong (exact), weak (regex), context_keywords |
Common customization scenarios ๏ผ
| Scenario | Where to edit |
|---|---|
| Title misclassified as "other" | Add to matching type's strong or weak |
| Need a new section type | Add to canonical_order + display_names + aliases |
| Switch AI model | Edit ai.model and ai.base_url |
| Chinese labels | Edit display_names |
A representative structured JSON output is provided below:
{
"source": "mineru",
"file": "paper_content_list_v2.json",
"backend": "regex",
"metadata": {
"title": "Accurate Generation of Conformational Ensembles...",
"authors": "Junjie Zhu, Zhengxin Li, ...",
"year": 2025,
"doi": "10.1002/advs.202511636",
"journal": "Advanced Science"
},
"sections": [
{
"canonical_type": "abstract",
"raw_title": "Abstract",
"display_title": "Abstract",
"level": 2,
"paragraphs": ["In this paper, we..."],
"subsections": []
},
{
"canonical_type": "introduction",
"raw_title": "1. Introduction",
"display_title": "Introduction",
"paragraphs": ["...", "[Figure: Figure 1. Architecture overview...]"],
"subsections": []
},
{
"canonical_type": "results",
"raw_title": "2. Results",
"display_title": "Results",
"subsections": [
{"raw_title": "2.1. Global Features", "paragraphs": ["..."]}
]
}
]
}
Approximately 15 standard section types are supported, consistent with conventional academic paper structure:
abstract introduction results discussion methods conclusion supplementary availability funding acknowledgements author_contributions keywords conflicts references other
Following generation of structured JSON files, targeted bulk section export can be performed on demand.
Functionally, the pubmedโexportโmd module for PubMed papers integrates the capabilities of mineruโparse and mineruโexportโmd.
โฏ paperflow mineru-export-md --help
Usage: paperflow mineru-export-md [OPTIONS]
Export structured mineru JSON to a clean Markdown file for LLM
processing.
Reads one or more JSON files produced by ``mineru-parse`` and
writes a
single Markdown file. Metadata (title, authors, year, DOI,
journal) is
always included. Content sections are included based on the
optional
YAML config.
YAML config format:
content_sections:
- abstract
- introduction
- methods
- results
- discussion
- conclusion
Examples:
paperflow mineru-export-md -i paper.json -o paper.md
paperflow mineru-export-md -i paper.json -o paper.md --config
extract.yaml
paperflow mineru-export-md -i ./parsed_dir -o all_papers.md
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --input -i TEXT Path to structured JSON file (from โ
โ mineru-parse), or a directory of such โ
โ files. โ
โ [required] โ
โ * --output -o TEXT Output Markdown file path. [required] โ
โ --config -c TEXT YAML config specifying โ
โ content_sections to include. If not โ
โ provided, all sections are included. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Similarly, the
-cparameter ofmineruโexportโmdaccepts a dedicated YAML configuration file mineru export config file for bulk section export configuration, with embedded documentation and revision guidelines. โ ๏ธ Section types defined in this export configuration file must be preโdeclared in canonical_order within mineru_config.yaml. Custom section types (e.g., ethics) defined during parsing may only be invoked in the export phase if preโregistered upstream. In short, mineru export config file and mineru config file must be mutually consistent.
mineru_config.yaml mineru_export_config.yaml
โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ canonical_order: โ โ content_sections: โ
โ - abstract โโโ ๅฎไน โ โ - abstract โ
โ - introduction โ ็ฑปๅๆฑ โ - introduction โ
โ - results โ โ - results โ
โ - ... โ โ - discussion โ
โ - ethics โ ่ชๅฎไน โ โ - methods โ
โโโโโโโโโโโโโโโโโโโโโโโโ โ - ethics โ ๅผ็จ โ
โโโโโโโโโโโโโโโโโโโโโโโโ
For instance, if ethics is added to canonical_order with corresponding aliases in mineru_config.yaml, the heading "Ethics Statement" within papers will be classified under the ethics section type during parsing. This type may then be selected in the export configuration file to extract relevant content. Unregistered section types cannot be recognised in the export phase.
Engineered for batch processing workflows, mineruโexportโmd scans all .json files within a specified nonโPubMed paper directory (it is recommended to store only mineruโparse outputs in an isolated directory to avoid extraneous JSON files). Files are sorted by name, with individual papers separated by ---, and consolidated into a single merged Markdown file.
5. Processing for Other Literature Databases
The preceding Steps 1โ4 are illustrated using PubMed as a representative literature database. The same processing logic applies to other academic platforms, such as arXiv, bioRxiv, medRxiv, chemRxiv, and more.
In theory, all DOIโdriven literature workflows can be standardised following the pipeline described above:
Retrieve PDF via DOI โ Preliminary PDF Parsing โ Content Extraction and Structured Processing
Modules dedicated to the aforementioned preprint platforms are still under development and refinement. Preprintโrelated subcommands are provided for testing purposes only. For detailed test cases, refer to Cases
6. Critical Reading and Knowledge Graph Analysis: Downstream EndโUse
Upon completing literature retrieval, parsing, and structured processing as outlined above, users obtain chapterโorganised Markdown files and structured JSON files, which serve as the fundamental inputs for subsequent critical reading and knowledge graph analysis.
Whether conducting continuous parsing of cuttingโedge individual papers or batchโprocessing literature for thematic research, Markdown files form the unified starting point. Stateโofโtheโart (SOTA) textโprocessing and logicalโanalysis models can be leveraged to assist knowledge graph construction or straightforward realโtime literature reading.
๐ As the most subjective downstream task, literature reading can still be transformed into quantifiable, repeatable workflows. Highly customised reading skills are commonly adopted to facilitate paper analysis. Relevant references are provided at paper reading skill
๐ Test Cases
We provide a set of test cases in Test Documentation, covering multiple types of literature data including PubMed, arXiv, bioRxiv, and more.
It also contains highly detailed stepโbyโstep execution logs of script workflows arranged in the logical order of literature research.
You may directly run the test scripts to verify the correctness and completeness of all functionalities.
๐ By combining the aforementioned
usage instructionswith thesetest cases, users can quickly get started with our tool.
๐ Future Maintenance & ToโDo List
1. Starting Point for Research
- [ ] Extend the BrainStorm skill and explore programmable integration of background prior knowledge.
2. Literature Search (and Metadata Scraping)
- [ ] Supplement query syntax for various literature databases and implement skillโbased support. Currently only partial MeSHโaware syntax priors for PubMed are integrated.
- [ ] Maintain and update the BioPython library (Eโutilities API) for PubMed parsing from this stage onward. Current version: BioPython 1.87; see biopython Repository for details.
3. Literature Acquisition (and FullโText Download)
- [ ] Refine and encapsulate the
paperโfetchmodule. Refer to 2026โ05โ08 paperโfetch Encapsulation; evaluate integration or replacement with more robust modules offering higher hit rates.- [ ] The
pdfโparsemodule currently wraps basic MinerU parsing commands with the CPU backend (โb pipeline). Future integration of GPUโaccelerated features; see MinerU Repository for details.
4. Literature Content Extraction and Structured Processing
- [ ] Improve JSONโstructured parsing of PMC plainโtext content within the
pubmedโexportโmdmodule: enhance semantic boundary validation by expanding regularโexpression matching ranges, or introduce an AI backend analogous to themineruโexportโmdmodule.- [ ] The
mineruโparsemodule parsescontent_list_v2.json. Official documentation indicates this output format is still evolving; ongoing tracking and maintenance are required. See MinerU Output File Documentation.- [ ] Enhance semantic boundary validation for the regex backend of
mineruโparseby expanding regularโexpression matching ranges.- [ ] Deepen integration of the AI backend within the
mineruโparsemodule.- [ ] Optimize coordination between YAML configuration files for the
mineruโparseandmineruโexportโmdmodules to achieve efficient mapping.- [ ] Design a standalone skill for segment extraction and structured processing of raw parsed Markdown content. Current workflows default to JSON files and underutilize Markdown outputs.
5. Processing for Other Literature Databases
- [ ] Develop a unified
searchโfetchโparsepipeline for nonโPubMed databases and complete corresponding modules. Refer to openโsource implementations such as paperscraper and paperโtracker.
6. Critical Reading and Knowledge Graph Analysis: Downstream EndโUse
- [ ] Develop highly customized skills for inโdepth literature analysis, preferably integrated into downstream workflows.
- [ ] Introduce persistent databases to scale and deepen functionality beyond a pure Pythonโbased project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pypaperflow-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pypaperflow-0.2.0-py3-none-any.whl
- Upload date:
- Size: 133.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b87cf29a443086bb490fa99c515e5e853a819d1b4e5c5d4af02f807c5501067
|
|
| MD5 |
2be908ca83495fd33827be9d597992a5
|
|
| BLAKE2b-256 |
d5342bb321f13e3417b364a3ca6430cd3c3f9076a77c4d80b8a510d200a29620
|