Document processing and evidence extraction package for precision oncology
Project description
PrecisionDoc - Medical Precision Document Processing Tool
This project processes medical guideline PDF files, especially treatment guidelines from CSCO (Chinese Society of Clinical Oncology). It can:
- Process PDF files in a specified folder
- Split PDF files into individual pages
- Analyze each page using AI (OpenAI or Alibaba Cloud Qwen)
- Extract precision medicine evidence related to drug efficacy
- Save analysis results in JSON and Excel formats
- Generate Word reports containing precision medicine evidence
Installation
From Source
- Clone this repository
- Install dependencies:
pip install -r requirements.txt
Using pip
pip install precisiondoc
Configuration
Create a .env file (refer to env.example) and set API keys:
OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4
QWEN_API_KEY=your_qwen_api_key
QWEN_BASES_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_TEXT_MODEL=qwen-max
QWEN_MULTIMODAL_MODEL=qwen-vl-max
LOG_LEVEL=INFO
Dependencies
The project requires the following main dependencies:
PyMuPDF: PDF processingopenai: OpenAI API clientpandasandopenpyxl: Data processing and Excel file handlingpython-docx: Word document generationpython-dotenv: Environment variable managementnumpy: Numerical operationsrequests: HTTP requeststqdm: Progress bars
All dependencies are listed in requirements.txt.
Usage
Command Line Interface
After installation, you can use the precisiondoc command:
# Process PDF files
precisiondoc process-pdf --folder /path/to/pdfs --output-folder ./output
# Convert Excel to Word
precisiondoc excel-to-word --excel-file /path/to/evidence.xlsx --multi-line --show-borders
Python API
You can also use PrecisionDoc as a Python package:
# Import the package
from precisiondoc import process_pdf, excel_to_word, process_single_pdf
# Process PDF files
results = process_pdf(
folder_path="/path/to/pdfs",
output_folder="./output",
api_key="your-api-key", # Optional, will use env var if not provided
base_url="https://api.example.com/v1", # Optional
model="gpt-4" # Optional
)
# Process a single PDF file
results = process_single_pdf(
pdf_path="/path/to/document.pdf",
doc_type="DocumentName", # Optional, will use filename if not provided
output_folder="./output", # Optional
api_key="your-api-key", # Optional
base_url="https://api.example.com/v1", # Optional
model="gpt-4" # Optional
)
# Convert Excel evidence to Word
word_file = excel_to_word(
excel_file="/path/to/evidence.xlsx",
word_file="/path/to/output.docx", # Optional
multi_line_text=True, # Optional
show_borders=True # Optional
)
Advanced Usage
For more advanced usage, you can directly use the classes provided by the package:
from precisiondoc import PDFProcessor, WordUtils, DataUtils
# Create a PDF processor
processor = PDFProcessor(
folder_path="/path/to/pdfs",
output_folder="./output",
api_key="your-api-key",
base_url="https://api.example.com/v1",
model="gpt-4"
)
# Process all PDFs
results = processor.process_all()
# Save results
processor.save_consolidated_results(results)
# Work with data utilities
data_utils = DataUtils()
df = data_utils.load_excel_file("/path/to/evidence.xlsx")
# Export to Word with custom formatting
WordUtils.export_evidence_to_word(
excel_file=df,
word_file="/path/to/output.docx",
multi_line_text=True,
show_borders=False,
exclude_columns=["column1", "column2"]
)
Environment Variables
The package uses the following environment variables:
API_KEY: API key for AI serviceBASE_URL: Base URL for API endpointTEXT_MODEL: Model name for text processingMULTIMODAL_MODEL: Model name for image processingLOG_LEVEL: Logging level (default: INFO)
You can set these variables in a .env file or directly in your environment.
Parameters
Command Line Parameters
--folder: Path to the folder containing PDF files (required)--api-key: API key for OpenAI or Qwen (if not provided, will be read from environment variables)--use-qwen: Use Qwen API instead of OpenAI (optional)--output-folder: Output folder path (optional, default: "./output")
Excel to Word Parameters
--excel-file: Path to Excel file with evidence data (required)--word-file: Path to output Word file (optional)--output-folder: Output folder path, used to find images (optional)--multi-line: Use multi-line text format (default: True)--show-borders: Show table borders (default: True)--exclude-columns: Columns to exclude from evidence text (optional)
Output
The program creates the following in the output directory:
pages/: Contains split single-page PDF filesimages/: (When using Qwen) Contains PDF page image filesjson/: JSON files with structured data and AI processing resultsexcel/: Excel files with flattened analysis resultsword/: Word files with extracted precision medicine evidence reports
Word Export Features
The Word export functionality includes several advanced formatting options:
-
Enhanced Table Layout:
- Left side displays multiple rows of text fields (one field per row)
- Right side shows images in a single vertically merged cell
- Customizable table borders (can be shown or hidden)
- Table continuation across pages for long evidence items
-
Page Formatting:
- Automatic page numbering in "Page X of Y" format
- Support for both portrait and landscape orientations
- Table continuation across page breaks
-
Text Formatting:
- Support for multi-line text display
- Consistent font styling
-
Image Handling:
- Automatic resizing and centering
- Fallback mechanism for missing images
-
Customization Parameters:
multi_line_text: Controls text formatting in the left cellTrue: Creates multiple rows, one for each key-value pairFalse: Creates a single row with JSON-style dictionary
show_borders: Controls table border visibilityTrue: Shows all table bordersFalse: Hides table borders for a cleaner look
Latest Features
1:1 PDF Processing Mapping
PrecisionDoc now ensures a strict 1:1 mapping between original PDF files and their output files (JSON, Excel, Word). This means:
- Each original PDF generates exactly one output file of each type
- Output files are initialized at the start of processing each PDF
- No redundant data accumulation on repeated runs
- Improved data organization and traceability
Page Metadata Enhancement
Each processed page now includes additional metadata:
- Current page number
- Total page count in the document
- Original PDF filename
- This enriches the JSON output with useful pagination context for better organization and reference.
Modular PDF Processing
The PDF processing pipeline has been refactored into smaller, more maintainable functions:
_initialize_output_files: Handles initialization of JSON, Excel, and Word output files_process_pdf_pages: Processes individual PDF pages and saves intermediate results_save_final_results: Saves final results to JSON, Excel, and Word files
Single PDF Processing
PrecisionDoc now supports processing individual PDF files directly:
- Process a specific PDF file without needing to place it in a dedicated folder
- Generate the same comprehensive outputs (JSON, Excel, Word) as with folder processing
- Maintain the same high-quality analysis and evidence extraction
- Useful for targeted processing of individual documents
Direct Excel-to-Word Conversion
Users can now convert Excel files to formatted Word documents without needing to process PDF files first:
- Supports various formatting options including multi-line text vs. JSON format
- Provides table borders control and column exclusion options
- Accessible via both command line and Python API
Future Plans
- Add support for additional PDF processing libraries for better handling of complex layouts
- Implement batch processing with multi-threading to improve performance
- Create a web-based user interface for easier interaction
- Add support for more languages and document types
- Enhance evidence extraction with more detailed categorization
- Improve image handling and OCR capabilities
- Add support for custom templates for Word export
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
- OpenAI and Alibaba Cloud for providing the AI APIs
- The open-source community for the various libraries used in this project
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file precisiondoc-0.1.3.tar.gz.
File metadata
- Download URL: precisiondoc-0.1.3.tar.gz
- Upload date:
- Size: 38.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a429e6d13441130011c7e962a1ba6d023b89c8c7e2fa77d7b7365f115bfef28a
|
|
| MD5 |
72f1936dba76b4a3ea7395ef3c5e8834
|
|
| BLAKE2b-256 |
5b6d18891b7dff18943934c56786529dd2773419b9c839d2d58b27dd65fc9ece
|
Provenance
The following attestation bundles were made for precisiondoc-0.1.3.tar.gz:
Publisher:
python-publish.yml on kaychiao/PrecisionDoc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
precisiondoc-0.1.3.tar.gz -
Subject digest:
a429e6d13441130011c7e962a1ba6d023b89c8c7e2fa77d7b7365f115bfef28a - Sigstore transparency entry: 393365653
- Sigstore integration time:
-
Permalink:
kaychiao/PrecisionDoc@04efc626fd3a265f963ba27b4d3f83d4014314a0 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/kaychiao
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@04efc626fd3a265f963ba27b4d3f83d4014314a0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file precisiondoc-0.1.3-py3-none-any.whl.
File metadata
- Download URL: precisiondoc-0.1.3-py3-none-any.whl
- Upload date:
- Size: 39.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d55d96a651c4606f5bc3f3dd2724c5e838ea56c07b151bb6c8d0464a73c75c0f
|
|
| MD5 |
8de0e74bdede4e75beb68afebeb45dbe
|
|
| BLAKE2b-256 |
b76db7843d17342a799eb963c0d4fef52e6885ceb80f2b81744efebf8eb0cdd3
|
Provenance
The following attestation bundles were made for precisiondoc-0.1.3-py3-none-any.whl:
Publisher:
python-publish.yml on kaychiao/PrecisionDoc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
precisiondoc-0.1.3-py3-none-any.whl -
Subject digest:
d55d96a651c4606f5bc3f3dd2724c5e838ea56c07b151bb6c8d0464a73c75c0f - Sigstore transparency entry: 393365657
- Sigstore integration time:
-
Permalink:
kaychiao/PrecisionDoc@04efc626fd3a265f963ba27b4d3f83d4014314a0 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/kaychiao
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@04efc626fd3a265f963ba27b4d3f83d4014314a0 -
Trigger Event:
push
-
Statement type: