Skip to main content

A tool for parsing PDF document layouts and chunking content

Project description

PDF Parser

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

Features

  • Convert PDF documents to images for processing.
  • Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
  • Process and refine bounding boxes.
  • Chunk document content based on detected layout.
  • (Optional) Perform OCR on detected elements using Azure Document Intelligence.
  • Save structured document data (layouts, chunks, OCR text) in JSON format.
  • Get paragraph embeddings using OpenAI embedder

Installation

Prerequisites

  • Python 3.10+
  • Pip package manager
  • (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

  1. Install the Package:
    # pip install kiwi-pdf-chunker
    

User-Provided Data

This package requires the user to provide certain data externally:

  1. Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
  2. Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwi_pdf_chunker-0.2.4.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiwi_pdf_chunker-0.2.4-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file kiwi_pdf_chunker-0.2.4.tar.gz.

File metadata

  • Download URL: kiwi_pdf_chunker-0.2.4.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for kiwi_pdf_chunker-0.2.4.tar.gz
Algorithm Hash digest
SHA256 cb4aa4a8aa55b594919346f49ed53c3dd2257f2cd43d148fff97ce22c91ccad9
MD5 a987d0adc96059564934afeeb8955996
BLAKE2b-256 b6f8bc37429e1880bdc469480e05c1959e0f58ee523676e7af11346526cee123

See more details on using hashes here.

File details

Details for the file kiwi_pdf_chunker-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for kiwi_pdf_chunker-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cd449547bbbe5ebcf6b2bba1e90991bd9a877eb44b04806d2d732f14425edc26
MD5 88d8525933fe167cc202027adcad3e76
BLAKE2b-256 52f0de0654ea32e52e0a9be87e8d3c2a3e059ae2988a8a4ff913a0e352093487

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page