Skip to main content

A tool for parsing PDF document layouts and chunking content

Project description

PDF Parser

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

Features

  • Convert PDF documents to images for processing.
  • Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
  • Process and refine bounding boxes.
  • Chunk document content based on detected layout.
  • (Optional) Perform OCR on detected elements using Azure Document Intelligence.
  • Save structured document data (layouts, chunks, OCR text) in JSON format.
  • Get paragraph embeddings using OpenAI embedder

Installation

Prerequisites

  • Python 3.10+
  • Pip package manager
  • (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

  1. Install the Package:
    # pip install kiwi-pdf-chunker
    

User-Provided Data

This package requires the user to provide certain data externally:

  1. Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
  2. Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwi_pdf_chunker-0.2.3.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiwi_pdf_chunker-0.2.3-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file kiwi_pdf_chunker-0.2.3.tar.gz.

File metadata

  • Download URL: kiwi_pdf_chunker-0.2.3.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for kiwi_pdf_chunker-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1c5f76480b8bb96c89a4fd7ce27aca3a818345973e646ec1a91d2cda6e4eb1e3
MD5 6f052b06ec4ded27760b32e10f0c2264
BLAKE2b-256 2f0d3ffffcb4b08caffc95bf8d037c2e91238aa36162fb4257f7fd92ae1cf08d

See more details on using hashes here.

File details

Details for the file kiwi_pdf_chunker-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for kiwi_pdf_chunker-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 36a1792a03d744af9bf64a7f3d54d7f271c8091e6a34cd39b04b4b9d1139fc56
MD5 cbc706bce2281b0a4f5c2e2c42024ed5
BLAKE2b-256 55696373da0316f71a9e0a169b46edcfed4550055f7cf4d8d385aac4fd543075

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page