A tool for parsing PDF document layouts and chunking content
Project description
PDF Parser
A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.
Features
- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- (Optional) Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder
Installation
Prerequisites
- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.
Steps
- Install the Package:
# pip install kiwi-pdf-chunker
User-Provided Data
This package requires the user to provide certain data externally:
- Input Directory (
input/): Place the PDF documents you want to process in a directory (e.g.,input/). You will need to provide the path to your input file(s) when using the package. - Models Directory (
models/): Download the necessary YOLO model(s) (e.g.,doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g.,models/). The path to this directory (or the specific model file) will be needed by the parser.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kiwi_pdf_chunker-0.2.2.tar.gz
(74.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kiwi_pdf_chunker-0.2.2.tar.gz.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.2.tar.gz
- Upload date:
- Size: 74.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
426c654d43f20bc6284d1e60f18467413c218ed453a42601c6516b8143cb0085
|
|
| MD5 |
05a2c6602825d9311d7c260c1f6a4d2a
|
|
| BLAKE2b-256 |
38785069f5d1722afc1986da84fe8f6a2e7eb83a812e79ff333b2d6cfe8c406e
|
File details
Details for the file kiwi_pdf_chunker-0.2.2-py3-none-any.whl.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.2-py3-none-any.whl
- Upload date:
- Size: 78.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6542dcc47840ef00bf98d31a6d4efb1e51cb77eed94a82a5aeeb3d79150a393
|
|
| MD5 |
38ecab96c6929c0507ecc07470d1181d
|
|
| BLAKE2b-256 |
a3304fdd6aa4063bb88fd98441ef871bf7c4e16353b8f9ed145f488422700b63
|