Skip to main content

A tool for parsing PDF document layouts and chunking content

Project description

PDF Parser

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

Features

  • Convert PDF documents to images for processing.
  • Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
  • Process and refine bounding boxes.
  • Chunk document content based on detected layout.
  • (Optional) Perform OCR on detected elements using Azure Document Intelligence.
  • Save structured document data (layouts, chunks, OCR text) in JSON format.
  • Get paragraph embeddings using OpenAI embedder

Installation

Prerequisites

  • Python 3.10+
  • Pip package manager
  • (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

  1. Install the Package:
    # pip install kiwi-pdf-chunker
    

User-Provided Data

This package requires the user to provide certain data externally:

  1. Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
  2. Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwi_pdf_chunker-0.2.2.tar.gz (74.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiwi_pdf_chunker-0.2.2-py3-none-any.whl (78.6 kB view details)

Uploaded Python 3

File details

Details for the file kiwi_pdf_chunker-0.2.2.tar.gz.

File metadata

  • Download URL: kiwi_pdf_chunker-0.2.2.tar.gz
  • Upload date:
  • Size: 74.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for kiwi_pdf_chunker-0.2.2.tar.gz
Algorithm Hash digest
SHA256 426c654d43f20bc6284d1e60f18467413c218ed453a42601c6516b8143cb0085
MD5 05a2c6602825d9311d7c260c1f6a4d2a
BLAKE2b-256 38785069f5d1722afc1986da84fe8f6a2e7eb83a812e79ff333b2d6cfe8c406e

See more details on using hashes here.

File details

Details for the file kiwi_pdf_chunker-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for kiwi_pdf_chunker-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d6542dcc47840ef00bf98d31a6d4efb1e51cb77eed94a82a5aeeb3d79150a393
MD5 38ecab96c6929c0507ecc07470d1181d
BLAKE2b-256 a3304fdd6aa4063bb88fd98441ef871bf7c4e16353b8f9ed145f488422700b63

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page