Skip to main content

A tool for parsing PDF document layouts and chunking content

Project description

CV Document Chunker

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, performing table parsing with Docling integration, and optionally performing OCR.

Features

  • Convert PDF documents to images for processing.
  • Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
  • Process and refine bounding boxes.
  • Chunk document content based on detected layout.
  • Intelligent table detection and parsing using Docling integration.
  • Generate table summaries, identify key columns, and classify table content.
  • Perform OCR on detected elements using Azure Document Intelligence or Tesseract.
  • Save structured document data (layouts, chunks, OCR text, parsed tables) in JSON format.
  • Generate paragraph embeddings using OpenAI, Azure OpenAI, or Hugging Face APIs.
  • Build document hierarchy based on layout analysis.

Installation

Prerequisites

  • Python 3.10+
  • Pip package manager
  • (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

  1. Install the Package:

    pip install kiwi-pdf-chunker
    
  2. Install models locally:

You must download the required models once to your machine. The library will then automatically find them in the default location (~/.kiwi_pdf_chunker/).

First, run these shell commands to create the directories and download the docling models:

# Create directories for all models
mkdir -p ~/.kiwi_pdf_chunker/docling_models
mkdir -p ~/.kiwi_pdf_chunker/easyocr_models

# Download Docling layout model
python -m docling.cli.models download layout -o ~/.kiwi_pdf_chunker/docling_models

# Download Docling table structure model (TableFormer)
python -m docling.cli.models download tableformer -o ~/.kiwi_pdf_chunker/docling_models

Next, run this short Python script to download and cache the EasyOCR models:

import easyocr
import os

# Define the target directory for EasyOCR models
target_dir = os.path.expanduser("~/.kiwi_pdf_chunker/easyocr_models")

# Initialize the reader, which will automatically download the models to the target directory
print("Downloading EasyOCR models...")
easyocr.Reader(
    ['en'],
    gpu=False,
    model_storage_directory=target_dir,
    download_enabled=True
)
print(f"EasyOCR models successfully downloaded to: {target_dir}")

You should end up with this directory structure in your home folder:

~/.kiwi_pdf_chunker/
├── docling_models/
│   ├── ... (docling model files)
└── easyocr_models/
    ├── craft_mlt_25k.pth
    ├── english_g2.pth
    └── dict/
        └── en.txt

By default, the package looks for models in the ~/.kiwi_pdf_chunker/ directory. You can override these paths using environment variables if you need to store the models elsewhere:

export DOCLING_ARTIFACTS_PATH=/path/to/your/docling_models
export EASYOCR_MODULE_PATH=/path/to/your/easyocr_models

User-Provided Data

This package requires the user to provide certain data externally:

  1. Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
  2. Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Usage

Basic Usage:

from kiwi_pdf_chunker.main import PDFParser

# --- User Configuration ---
input_pdf_path = "path/to/your/input/document.pdf" # Path to user's PDF
model_path = "path/to/your/models/doclayout_yolo.pt" # Path to user's model
output_dir = "path/to/your/output/" # Directory to save results

# Basic parser with OCR
parser = PDFParser(
    yolo_model_path=model_path,
    ocr=True,
    azure_ocr_endpoint="your_azure_endpoint",
    azure_ocr_key="your_azure_key"
)

results = parser.parse_document(input_pdf_path, output_dir=output_dir)

Advanced Usage with Table Classification and Embeddings:

from kiwi_pdf_chunker.main import PDFParser

# Advanced parser with table classification and embeddings
parser = PDFParser(
    yolo_model_path=model_path,
    ocr=True,
    azure_ocr_endpoint="your_azure_endpoint",
    azure_ocr_key="your_azure_key",
    embed=True,
    classify_tables=True,
    hf_token="your_huggingface_token",  # For embeddings
    hf_endpoint="your_huggingface_endpoint",
    azure_openai_api_key="your_azure_openai_key",  # For table classification
    azure_openai_api_version="2024-02-15-preview",
    azure_openai_endpoint="your_azure_openai_endpoint"
)

results = parser.parse_document(
    input_pdf_path, 
    output_dir=output_dir,
    generate_annotations=True,
    save_bounding_boxes=True,
    use_tesseract=False  # Use Azure OCR instead of Tesseract
)

Constructor Parameters

The PDFParser class accepts the following parameters:

Core Parameters

  • yolo_model_path (str, optional): Path to the YOLO model file. If None, uses the default path from config.
  • debug_mode (bool, optional): Enable debug mode with additional logging and outputs. Defaults to False.
  • container_threshold (int, optional): Minimum number of contained boxes required to remove a container box.
  • hierarchy (bool, optional): Enable hierarchy generation. Defaults to True.

OCR Parameters

  • ocr (bool, optional): Enable OCR processing. Defaults to False.
  • azure_ocr_endpoint (str, optional): Azure Document Intelligence endpoint URL.
  • azure_ocr_key (str, optional): Azure Document Intelligence API key.

Embedding Parameters

  • embed (bool, optional): If True, generate embeddings for extracted text. Defaults to False.
  • embedding_model (str, optional): Name of the OpenAI model for embeddings. Defaults to "text-embedding-3-small".
  • openai_api_key (str, optional): API key for standard OpenAI service.
  • azure_openai_api_key (str, optional): API key for Azure OpenAI service.
  • azure_openai_api_version (str, optional): API version for Azure OpenAI service.
  • azure_openai_endpoint_embedding (str, optional): Endpoint URL for Azure OpenAI service for text embeddings.
  • hf_token (str, optional): Hugging Face API token for embeddings.
  • hf_endpoint (str, optional): Hugging Face endpoint URL for embeddings.

Table Classification Parameters

  • classify_tables (bool, optional): If True, classify tables in the document. Defaults to False.
  • table_categories (list, optional): List of table categories for classification.
  • azure_openai_endpoint (str, optional): Endpoint URL for Azure OpenAI service for table classification.
  • table_classification_system_prompt (str, optional): System prompt for table classification.
  • table_summary_system_prompt (str, optional): System prompt for table summary generation.
  • table_id_column_system_prompt (str, optional): System prompt for table ID column identification.

Understanding the Output

After running the parser, the following outputs will typically be available in the specified output_dir:

  1. boxes.json: JSON file containing the detected document structure (element labels, coordinates, confidence).
  2. tables.json: JSON file containing parsed table data with hierarchical structure for "good" tables and vision model output for "bad" tables.
  3. table_screenshots/: Directory containing screenshots of detected tables for debugging and verification.
  4. annotations/: Directory containing annotated images showing the detected elements for each page (if generate_annotations=True).
  5. boxes/: Directory containing individual images for each detected element, organized by page number (if save_bounding_boxes=True). This is required for OCR.
  6. text.json: (Only if ocr=True) JSON file containing the extracted text for each element, sorted according to the structure defined in boxes.json.
  7. embeddings.json: (Only if embed=True) JSON file containing embeddings for each text element.
  8. hierarchy.json: (Only if hierarchy=True) JSON file containing the document hierarchy structure.

Table Parsing Features

This package integrates Docling for PDF parsing.
It is configured to run entirely offline with the required models bundled inside the package.

The library provides advanced table parsing capabilities:

  • Automatic Table Detection: Uses Docling integration for superior table detection.
  • Table Structure Classification: Automatically classifies tables as "good" (hierarchically parseable) or "bad" (requiring vision models).
  • Table Parsing: For "good" tables, builds hierarchical tree structures preserving table row relationships. For "bad" tables, uses vision models to extract structured data.
  • Table Content Analysis: When classify_tables=True, provides:
    • Table content classification
    • Table summaries
    • Key column identification

If debug mode is enabled (debug_mode=True), additional debug images might be saved, typically in a debug/ subdirectory within the output_dir, showing intermediate steps of the parsing process.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwi_pdf_chunker-0.3.3.tar.gz (45.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiwi_pdf_chunker-0.3.3-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file kiwi_pdf_chunker-0.3.3.tar.gz.

File metadata

  • Download URL: kiwi_pdf_chunker-0.3.3.tar.gz
  • Upload date:
  • Size: 45.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for kiwi_pdf_chunker-0.3.3.tar.gz
Algorithm Hash digest
SHA256 ec2536359fb67b3d4d8fa7c7ae8783c6e6553cff1f744217f857f210dcfde98e
MD5 cb85f2b143b95f2dbacbec4808587afc
BLAKE2b-256 35e182bf97167f8c553d32d2a87d84784720423e31b6e5e6d15a29a9d6189e97

See more details on using hashes here.

File details

Details for the file kiwi_pdf_chunker-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for kiwi_pdf_chunker-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bd157de709eae25e12e53cf04489dbf441b763b1abb33880c20cf28512b015aa
MD5 a04d62b35cd3783fbc51e3b51f0ea89f
BLAKE2b-256 05d35675fc3f16f0dea4ca699d27eb58425091921f1c52d3e323752564385e52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page