Skip to main content

A tool for parsing PDF document layouts and chunking content

Project description

PDF Parser

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

Features

  • Convert PDF documents to images for processing.
  • Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
  • Process and refine bounding boxes.
  • Chunk document content based on detected layout.
  • (Optional) Perform OCR on detected elements using Azure Document Intelligence.
  • Save structured document data (layouts, chunks, OCR text) in JSON format.
  • Get paragraph embeddings using OpenAI embedder

Installation

Prerequisites

  • Python 3.10+
  • Pip package manager
  • (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

  1. Install the Package:
    # pip install kiwi-pdf-chunker
    

User-Provided Data

This package requires the user to provide certain data externally:

  1. Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
  2. Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwi_pdf_chunker-0.2.5.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiwi_pdf_chunker-0.2.5-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file kiwi_pdf_chunker-0.2.5.tar.gz.

File metadata

  • Download URL: kiwi_pdf_chunker-0.2.5.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for kiwi_pdf_chunker-0.2.5.tar.gz
Algorithm Hash digest
SHA256 34a044420c3c9fa804600e9a76034c99f63ffa27570f4741736753e006e7f869
MD5 99105d0cae35affe19ae83feda9f2833
BLAKE2b-256 14e5d06585429c5931fe85e9cb2ce4587111203d99014f412bc7b7ae451abf34

See more details on using hashes here.

File details

Details for the file kiwi_pdf_chunker-0.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for kiwi_pdf_chunker-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 285426bdabb6e6efc4cd086c7cb0ccae80af5e7873741ed976beef13d87af6be
MD5 086812da244c37a7f141bd3831882b37
BLAKE2b-256 f2889e389d70834515af502ee3eecfa01df23e9ee305150f8e1e54506249d06a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page