Skip to main content

A tool for parsing PDF document layouts and chunking content

Project description

PDF Parser

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

Features

  • Convert PDF documents to images for processing.
  • Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
  • Process and refine bounding boxes.
  • Chunk document content based on detected layout.
  • (Optional) Perform OCR on detected elements using Azure Document Intelligence.
  • Save structured document data (layouts, chunks, OCR text) in JSON format.
  • Get paragraph embeddings using OpenAI embedder

Installation

Prerequisites

  • Python 3.10+
  • Pip package manager
  • (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

  1. Install the Package:
    # pip install kiwi-pdf-chunker
    

User-Provided Data

This package requires the user to provide certain data externally:

  1. Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
  2. Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwi_pdf_chunker-0.2.1.tar.gz (74.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiwi_pdf_chunker-0.2.1-py3-none-any.whl (78.6 kB view details)

Uploaded Python 3

File details

Details for the file kiwi_pdf_chunker-0.2.1.tar.gz.

File metadata

  • Download URL: kiwi_pdf_chunker-0.2.1.tar.gz
  • Upload date:
  • Size: 74.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for kiwi_pdf_chunker-0.2.1.tar.gz
Algorithm Hash digest
SHA256 dfff95b6e988c0f820c89f952105b80758ba3f909e64b245b50d937acb759cf5
MD5 1f3f68613167dd76fd474d0cb9121db3
BLAKE2b-256 5bcb445dc38c6f8d5f95d6cd8b426f36f828543b028981917bf71695d7180703

See more details on using hashes here.

File details

Details for the file kiwi_pdf_chunker-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for kiwi_pdf_chunker-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9a7ef14ff6b5bf46f09d7ba7f51ab6e13a80a46078214fe617156566e32785cf
MD5 601ab26e49d0da8b2ec88dfddc510d44
BLAKE2b-256 36c431ffb2fcc6c09c9827931230b162c29dbb72d4fab5e732ca1a128108ff31

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page