Document processing and evidence extraction package for precision oncology

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

KayChiao

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PrecisionDoc - Medical Precision Document Processing Tool

This project processes medical guideline PDF files, especially treatment guidelines from CSCO (Chinese Society of Clinical Oncology). It can:

Process PDF files in a specified folder
Split PDF files into individual pages
Analyze each page using AI (OpenAI or Alibaba Cloud Qwen)
Extract precision medicine evidence related to drug efficacy
Save analysis results in JSON and Excel formats
Generate Word reports containing precision medicine evidence

Installation

From Source

Clone this repository
Install dependencies:

pip install -r requirements.txt

Using pip

pip install precisiondoc

Configuration

Create a .env file (refer to env.example) and set API keys:

OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4

QWEN_API_KEY=your_qwen_api_key
QWEN_BASES_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_TEXT_MODEL=qwen-max
QWEN_MULTIMODAL_MODEL=qwen-vl-max

LOG_LEVEL=INFO

Dependencies

The project requires the following main dependencies:

PyMuPDF: PDF processing
openai: OpenAI API client
pandas and openpyxl: Data processing and Excel file handling
python-docx: Word document generation
python-dotenv: Environment variable management
numpy: Numerical operations
requests: HTTP requests
tqdm: Progress bars

All dependencies are listed in requirements.txt.

Usage

Command Line Interface

After installation, you can use the precisiondoc command:

# Process PDF files
precisiondoc process-pdf --folder /path/to/pdfs --output-folder ./output

# Convert Excel to Word
precisiondoc excel-to-word --excel-file /path/to/evidence.xlsx --multi-line --show-borders

Python API

You can also use PrecisionDoc as a Python package:

# Import the package
from precisiondoc import process_pdf, excel_to_word, process_single_pdf

# Process PDF files
results = process_pdf(
    folder_path="/path/to/pdfs",
    output_folder="./output",
    api_key="your-api-key",  # Optional, will use env var if not provided
    base_url="https://api.example.com/v1",  # Optional
    model="gpt-4"  # Optional
)

# Process a single PDF file
results = process_single_pdf(
    pdf_path="/path/to/document.pdf",
    doc_type="DocumentName",  # Optional, will use filename if not provided
    output_folder="./output",  # Optional
    api_key="your-api-key",  # Optional
    base_url="https://api.example.com/v1",  # Optional
    model="gpt-4"  # Optional
)

# Convert Excel evidence to Word
word_file = excel_to_word(
    excel_file="/path/to/evidence.xlsx",
    word_file="/path/to/output.docx",  # Optional
    multi_line_text=True,  # Optional
    show_borders=True  # Optional
)

Advanced Usage

For more advanced usage, you can directly use the classes provided by the package:

from precisiondoc import PDFProcessor, WordUtils, DataUtils

# Create a PDF processor
processor = PDFProcessor(
    folder_path="/path/to/pdfs",
    output_folder="./output",
    api_key="your-api-key",
    base_url="https://api.example.com/v1",
    model="gpt-4"
)

# Process all PDFs
results = processor.process_all()

# Save results
processor.save_consolidated_results(results)

# Work with data utilities
data_utils = DataUtils()
df = data_utils.load_excel_file("/path/to/evidence.xlsx")

# Export to Word with custom formatting
WordUtils.export_evidence_to_word(
    excel_file=df,
    word_file="/path/to/output.docx",
    multi_line_text=True,
    show_borders=False,
    exclude_columns=["column1", "column2"]
)

Environment Variables

The package uses the following environment variables:

API_KEY: API key for AI service
BASE_URL: Base URL for API endpoint
TEXT_MODEL: Model name for text processing
MULTIMODAL_MODEL: Model name for image processing
LOG_LEVEL: Logging level (default: INFO)

You can set these variables in a .env file or directly in your environment.

Parameters

Command Line Parameters

--folder: Path to the folder containing PDF files (required)
--api-key: API key for OpenAI or Qwen (if not provided, will be read from environment variables)
--use-qwen: Use Qwen API instead of OpenAI (optional)
--output-folder: Output folder path (optional, default: "./output")

Excel to Word Parameters

--excel-file: Path to Excel file with evidence data (required)
--word-file: Path to output Word file (optional)
--output-folder: Output folder path, used to find images (optional)
--multi-line: Use multi-line text format (default: True)
--show-borders: Show table borders (default: True)
--exclude-columns: Columns to exclude from evidence text (optional)

Output

The program creates the following in the output directory:

pages/: Contains split single-page PDF files
images/: (When using Qwen) Contains PDF page image files
json/: JSON files with structured data and AI processing results
excel/: Excel files with flattened analysis results
word/: Word files with extracted precision medicine evidence reports

Word Export Features

The Word export functionality includes several advanced formatting options:

Enhanced Table Layout:
- Left side displays multiple rows of text fields (one field per row)
- Right side shows images in a single vertically merged cell
- Customizable table borders (can be shown or hidden)
- Table continuation across pages for long evidence items
Page Formatting:
- Automatic page numbering in "Page X of Y" format
- Support for both portrait and landscape orientations
- Table continuation across page breaks
Text Formatting:
- Support for multi-line text display
- Consistent font styling
Image Handling:
- Automatic resizing and centering
- Fallback mechanism for missing images
Customization Parameters:
- multi_line_text: Controls text formatting in the left cell
  - True: Creates multiple rows, one for each key-value pair
  - False: Creates a single row with JSON-style dictionary
- show_borders: Controls table border visibility
  - True: Shows all table borders
  - False: Hides table borders for a cleaner look

Latest Features

1:1 PDF Processing Mapping

PrecisionDoc now ensures a strict 1:1 mapping between original PDF files and their output files (JSON, Excel, Word). This means:

Each original PDF generates exactly one output file of each type
Output files are initialized at the start of processing each PDF
No redundant data accumulation on repeated runs
Improved data organization and traceability

Page Metadata Enhancement

Each processed page now includes additional metadata:

Current page number
Total page count in the document
Original PDF filename
This enriches the JSON output with useful pagination context for better organization and reference.

Modular PDF Processing

The PDF processing pipeline has been refactored into smaller, more maintainable functions:

_initialize_output_files: Handles initialization of JSON, Excel, and Word output files
_process_pdf_pages: Processes individual PDF pages and saves intermediate results
_save_final_results: Saves final results to JSON, Excel, and Word files

Single PDF Processing

PrecisionDoc now supports processing individual PDF files directly:

Process a specific PDF file without needing to place it in a dedicated folder
Generate the same comprehensive outputs (JSON, Excel, Word) as with folder processing
Maintain the same high-quality analysis and evidence extraction
Useful for targeted processing of individual documents

Direct Excel-to-Word Conversion

Users can now convert Excel files to formatted Word documents without needing to process PDF files first:

Supports various formatting options including multi-line text vs. JSON format
Provides table borders control and column exclusion options
Accessible via both command line and Python API

Future Plans

Add support for additional PDF processing libraries for better handling of complex layouts
Implement batch processing with multi-threading to improve performance
Create a web-based user interface for easier interaction
Add support for more languages and document types
Enhance evidence extraction with more detailed categorization
Improve image handling and OCR capabilities
Add support for custom templates for Word export

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

OpenAI and Alibaba Cloud for providing the AI APIs
The open-source community for the various libraries used in this project

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

KayChiao

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.3

Aug 14, 2025

0.1.1

Aug 14, 2025

0.1.1rc1 pre-release

Aug 14, 2025

0.1.0

Aug 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

precisiondoc-0.1.3.tar.gz (38.4 kB view details)

Uploaded Aug 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

precisiondoc-0.1.3-py3-none-any.whl (39.7 kB view details)

Uploaded Aug 14, 2025 Python 3

File details

Details for the file precisiondoc-0.1.3.tar.gz.

File metadata

Download URL: precisiondoc-0.1.3.tar.gz
Upload date: Aug 14, 2025
Size: 38.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for precisiondoc-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`a429e6d13441130011c7e962a1ba6d023b89c8c7e2fa77d7b7365f115bfef28a`
MD5	`72f1936dba76b4a3ea7395ef3c5e8834`
BLAKE2b-256	`5b6d18891b7dff18943934c56786529dd2773419b9c839d2d58b27dd65fc9ece`

See more details on using hashes here.

Provenance

The following attestation bundles were made for precisiondoc-0.1.3.tar.gz:

Publisher: python-publish.yml on kaychiao/PrecisionDoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: precisiondoc-0.1.3.tar.gz
- Subject digest: a429e6d13441130011c7e962a1ba6d023b89c8c7e2fa77d7b7365f115bfef28a
- Sigstore transparency entry: 393365653
- Sigstore integration time: Aug 14, 2025
Source repository:
- Permalink: kaychiao/PrecisionDoc@04efc626fd3a265f963ba27b4d3f83d4014314a0
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kaychiao
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@04efc626fd3a265f963ba27b4d3f83d4014314a0
- Trigger Event: push

File details

Details for the file precisiondoc-0.1.3-py3-none-any.whl.

File metadata

Download URL: precisiondoc-0.1.3-py3-none-any.whl
Upload date: Aug 14, 2025
Size: 39.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for precisiondoc-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d55d96a651c4606f5bc3f3dd2724c5e838ea56c07b151bb6c8d0464a73c75c0f`
MD5	`8de0e74bdede4e75beb68afebeb45dbe`
BLAKE2b-256	`b76db7843d17342a799eb963c0d4fef52e6885ceb80f2b81744efebf8eb0cdd3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for precisiondoc-0.1.3-py3-none-any.whl:

Publisher: python-publish.yml on kaychiao/PrecisionDoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: precisiondoc-0.1.3-py3-none-any.whl
- Subject digest: d55d96a651c4606f5bc3f3dd2724c5e838ea56c07b151bb6c8d0464a73c75c0f
- Sigstore transparency entry: 393365657
- Sigstore integration time: Aug 14, 2025
Source repository:
- Permalink: kaychiao/PrecisionDoc@04efc626fd3a265f963ba27b4d3f83d4014314a0
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kaychiao
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@04efc626fd3a265f963ba27b4d3f83d4014314a0
- Trigger Event: push

precisiondoc 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

PrecisionDoc - Medical Precision Document Processing Tool

Installation

From Source

Using pip

Configuration

Dependencies

Usage

Command Line Interface

Python API

Advanced Usage

Environment Variables

Parameters

Command Line Parameters

Excel to Word Parameters

Output

Word Export Features

Latest Features

1:1 PDF Processing Mapping

Page Metadata Enhancement

Modular PDF Processing

Single PDF Processing

Direct Excel-to-Word Conversion

Future Plans

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance