Skip to main content

A tool for parsing PDF document layouts and chunking content

Project description

CV Document Chunker

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

Features

  • Convert PDF documents to images for processing.
  • Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
  • Process and refine bounding boxes.
  • Chunk document content based on detected layout.
  • (Optional) Perform OCR on detected elements using Azure Document Intelligence.
  • Save structured document data (layouts, chunks, OCR text) in JSON format.
  • Get paragraph embeddings using OpenAI embedder

Installation

Prerequisites

  • Python 3.10+
  • Pip package manager
  • (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

  1. Install the Package:
    # pip install cv-doc-chunker
    

User-Provided Data

This package requires the user to provide certain data externally:

  1. Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
  2. Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Usage

Provide examples of how to import and use your library functions or the command-line tool.

Example (Conceptual Python Usage):

from cv_doc_chunker import PDFProcessor

# --- User Configuration ---
input_pdf_path = "path/to/your/input/document.pdf" # Path to user's PDF
model_path = "path/to/your/models/doclayout_yolo.pt" # Path to user's model
output_dir = "path/to/your/output/" # Directory to save results


parser = PDFParser(ocr = True, embed = True, yolo_model_path = model_path, azure_key = "api key for azure ocr",
                   azure_endpoint = "api endpoint for azure ocr", openai_api_key = openai_api_key)

# --- OR ---
# For Azure OpenAI embeddings, you would use these arguments instead:
# azure_openai_api_key=azure_openai_api_key,
# azure_openai_api_version=azure_openai_api_version,
# azure_openai_endpoint=azure_openai_endpoint

results = parser.parse_document(input_pdf_path, output_dir=output_dir, use_tesseract = True)

Understanding the Output

After running the parser, the following outputs will typically be available in the specified output_dir:

  1. {your-document}_parsed.json: JSON file containing the detected document structure (element labels, coordinates, confidence).
  2. {your-document}_annotations/: Directory containing annotated images showing the detected elements for each page (if generate_annotations=True).
  3. {your-document}_boxes/: Directory containing individual images for each detected element, organized by page number (if save_bounding_boxes=True). This is required for OCR.
  4. {your-document}_sorted_text.json: (Only if ocr=True) JSON file containing the extracted text for each element, sorted according to the structure defined in _parsed.json.

If debug mode is enabled (debug_mode=True), additional debug images might be saved, typically in a debug/ subdirectory within the output_dir, showing intermediate steps of the parsing process.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwi_pdf_chunker-0.1.0.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiwi_pdf_chunker-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file kiwi_pdf_chunker-0.1.0.tar.gz.

File metadata

  • Download URL: kiwi_pdf_chunker-0.1.0.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for kiwi_pdf_chunker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bbfc17212de54491ccdb725896d75cd1777d079ad22b70cb8395e8e01ba2e4ae
MD5 3e02707b99c2fb7d043c256dd5e43935
BLAKE2b-256 6ba34cedb5b7c4c162079bc76e5a7f0456905ad62e340f6e12b6f676454512bc

See more details on using hashes here.

File details

Details for the file kiwi_pdf_chunker-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kiwi_pdf_chunker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cbdb711a7efa5e397c3dbacaa3a392f1dee491db646117516dabc926f510f9b2
MD5 73460d4d264fef1483581612dbbda8f7
BLAKE2b-256 c1bf4be4161e13b3caf55aff2cda72a5a0b68f73a81e427a266de4fe52de8628

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page