Skip to main content

Document processing and evidence extraction package for precision oncology

Project description

PrecisionDoc - Medical Precision Document Processing Tool

This project processes medical guideline PDF files, especially treatment guidelines from CSCO (Chinese Society of Clinical Oncology). It can:

  1. Process PDF files in a specified folder
  2. Split PDF files into individual pages
  3. Analyze each page using AI (OpenAI or Alibaba Cloud Qwen)
  4. Extract precision medicine evidence related to drug efficacy
  5. Save analysis results in JSON and Excel formats
  6. Generate Word reports containing precision medicine evidence

Installation

From Source

  1. Clone this repository
  2. Install dependencies:
pip install -r requirements.txt

Using pip

pip install precisiondoc

Configuration

Create a .env file (refer to env.example) and set API keys:

OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4

QWEN_API_KEY=your_qwen_api_key
QWEN_BASES_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_TEXT_MODEL=qwen-max
QWEN_MULTIMODAL_MODEL=qwen-vl-max

LOG_LEVEL=INFO

Dependencies

The project requires the following main dependencies:

  • PyMuPDF: PDF processing
  • openai: OpenAI API client
  • pandas and openpyxl: Data processing and Excel file handling
  • python-docx: Word document generation
  • python-dotenv: Environment variable management
  • numpy: Numerical operations
  • requests: HTTP requests
  • tqdm: Progress bars

All dependencies are listed in requirements.txt.

Usage

Command Line Interface

After installation, you can use the precisiondoc command:

# Process PDF files
precisiondoc process-pdf --folder /path/to/pdfs --output-folder ./output

# Convert Excel to Word
precisiondoc excel-to-word --excel-file /path/to/evidence.xlsx --multi-line --show-borders

Python API

You can also use PrecisionDoc as a Python package:

# Import the package
from precisiondoc import process_pdf, excel_to_word, process_single_pdf

# Process PDF files
results = process_pdf(
    folder_path="/path/to/pdfs",
    output_folder="./output",
    api_key="your-api-key",  # Optional, will use env var if not provided
    base_url="https://api.example.com/v1",  # Optional
    model="gpt-4"  # Optional
)

# Process a single PDF file
results = process_single_pdf(
    pdf_path="/path/to/document.pdf",
    doc_type="DocumentName",  # Optional, will use filename if not provided
    output_folder="./output",  # Optional
    api_key="your-api-key",  # Optional
    base_url="https://api.example.com/v1",  # Optional
    model="gpt-4"  # Optional
)

# Convert Excel evidence to Word
word_file = excel_to_word(
    excel_file="/path/to/evidence.xlsx",
    word_file="/path/to/output.docx",  # Optional
    multi_line_text=True,  # Optional
    show_borders=True  # Optional
)

Advanced Usage

For more advanced usage, you can directly use the classes provided by the package:

from precisiondoc import PDFProcessor, WordUtils, DataUtils

# Create a PDF processor
processor = PDFProcessor(
    folder_path="/path/to/pdfs",
    output_folder="./output",
    api_key="your-api-key",
    base_url="https://api.example.com/v1",
    model="gpt-4"
)

# Process all PDFs
results = processor.process_all()

# Save results
processor.save_consolidated_results(results)

# Work with data utilities
data_utils = DataUtils()
df = data_utils.load_excel_file("/path/to/evidence.xlsx")

# Export to Word with custom formatting
WordUtils.export_evidence_to_word(
    excel_file=df,
    word_file="/path/to/output.docx",
    multi_line_text=True,
    show_borders=False,
    exclude_columns=["column1", "column2"]
)

Environment Variables

The package uses the following environment variables:

  • API_KEY: API key for AI service
  • BASE_URL: Base URL for API endpoint
  • TEXT_MODEL: Model name for text processing
  • MULTIMODAL_MODEL: Model name for image processing
  • LOG_LEVEL: Logging level (default: INFO)

You can set these variables in a .env file or directly in your environment.

Parameters

Command Line Parameters

  • --folder: Path to the folder containing PDF files (required)
  • --api-key: API key for OpenAI or Qwen (if not provided, will be read from environment variables)
  • --use-qwen: Use Qwen API instead of OpenAI (optional)
  • --output-folder: Output folder path (optional, default: "./output")

Excel to Word Parameters

  • --excel-file: Path to Excel file with evidence data (required)
  • --word-file: Path to output Word file (optional)
  • --output-folder: Output folder path, used to find images (optional)
  • --multi-line: Use multi-line text format (default: True)
  • --show-borders: Show table borders (default: True)
  • --exclude-columns: Columns to exclude from evidence text (optional)

Output

The program creates the following in the output directory:

  • pages/: Contains split single-page PDF files
  • images/: (When using Qwen) Contains PDF page image files
  • json/: JSON files with structured data and AI processing results
  • excel/: Excel files with flattened analysis results
  • word/: Word files with extracted precision medicine evidence reports

Word Export Features

The Word export functionality includes several advanced formatting options:

  • Enhanced Table Layout:

    • Left side displays multiple rows of text fields (one field per row)
    • Right side shows images in a single vertically merged cell
    • Customizable table borders (can be shown or hidden)
    • Table continuation across pages for long evidence items
  • Page Formatting:

    • Automatic page numbering in "Page X of Y" format
    • Support for both portrait and landscape orientations
    • Table continuation across page breaks
  • Text Formatting:

    • Support for multi-line text display
    • Consistent font styling
  • Image Handling:

    • Automatic resizing and centering
    • Fallback mechanism for missing images
  • Customization Parameters:

    • multi_line_text: Controls text formatting in the left cell
      • True: Creates multiple rows, one for each key-value pair
      • False: Creates a single row with JSON-style dictionary
    • show_borders: Controls table border visibility
      • True: Shows all table borders
      • False: Hides table borders for a cleaner look

Latest Features

1:1 PDF Processing Mapping

PrecisionDoc now ensures a strict 1:1 mapping between original PDF files and their output files (JSON, Excel, Word). This means:

  • Each original PDF generates exactly one output file of each type
  • Output files are initialized at the start of processing each PDF
  • No redundant data accumulation on repeated runs
  • Improved data organization and traceability

Page Metadata Enhancement

Each processed page now includes additional metadata:

  • Current page number
  • Total page count in the document
  • Original PDF filename
  • This enriches the JSON output with useful pagination context for better organization and reference.

Modular PDF Processing

The PDF processing pipeline has been refactored into smaller, more maintainable functions:

  • _initialize_output_files: Handles initialization of JSON, Excel, and Word output files
  • _process_pdf_pages: Processes individual PDF pages and saves intermediate results
  • _save_final_results: Saves final results to JSON, Excel, and Word files

Single PDF Processing

PrecisionDoc now supports processing individual PDF files directly:

  • Process a specific PDF file without needing to place it in a dedicated folder
  • Generate the same comprehensive outputs (JSON, Excel, Word) as with folder processing
  • Maintain the same high-quality analysis and evidence extraction
  • Useful for targeted processing of individual documents

Direct Excel-to-Word Conversion

Users can now convert Excel files to formatted Word documents without needing to process PDF files first:

  • Supports various formatting options including multi-line text vs. JSON format
  • Provides table borders control and column exclusion options
  • Accessible via both command line and Python API

Future Plans

  • Add support for additional PDF processing libraries for better handling of complex layouts
  • Implement batch processing with multi-threading to improve performance
  • Create a web-based user interface for easier interaction
  • Add support for more languages and document types
  • Enhance evidence extraction with more detailed categorization
  • Improve image handling and OCR capabilities
  • Add support for custom templates for Word export

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

  • OpenAI and Alibaba Cloud for providing the AI APIs
  • The open-source community for the various libraries used in this project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

precisiondoc-0.1.3.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

precisiondoc-0.1.3-py3-none-any.whl (39.7 kB view details)

Uploaded Python 3

File details

Details for the file precisiondoc-0.1.3.tar.gz.

File metadata

  • Download URL: precisiondoc-0.1.3.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for precisiondoc-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a429e6d13441130011c7e962a1ba6d023b89c8c7e2fa77d7b7365f115bfef28a
MD5 72f1936dba76b4a3ea7395ef3c5e8834
BLAKE2b-256 5b6d18891b7dff18943934c56786529dd2773419b9c839d2d58b27dd65fc9ece

See more details on using hashes here.

Provenance

The following attestation bundles were made for precisiondoc-0.1.3.tar.gz:

Publisher: python-publish.yml on kaychiao/PrecisionDoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file precisiondoc-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: precisiondoc-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 39.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for precisiondoc-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d55d96a651c4606f5bc3f3dd2724c5e838ea56c07b151bb6c8d0464a73c75c0f
MD5 8de0e74bdede4e75beb68afebeb45dbe
BLAKE2b-256 b76db7843d17342a799eb963c0d4fef52e6885ceb80f2b81744efebf8eb0cdd3

See more details on using hashes here.

Provenance

The following attestation bundles were made for precisiondoc-0.1.3-py3-none-any.whl:

Publisher: python-publish.yml on kaychiao/PrecisionDoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page