A tool for parsing PDF document layouts and chunking content
Project description
PDF Parser
A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.
Features
- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- (Optional) Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder
Installation
Prerequisites
- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.
Steps
- Install the Package:
# pip install kiwi-pdf-chunker
User-Provided Data
This package requires the user to provide certain data externally:
- Input Directory (
input/): Place the PDF documents you want to process in a directory (e.g.,input/). You will need to provide the path to your input file(s) when using the package. - Models Directory (
models/): Download the necessary YOLO model(s) (e.g.,doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g.,models/). The path to this directory (or the specific model file) will be needed by the parser.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kiwi_pdf_chunker-0.2.1.tar.gz
(74.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kiwi_pdf_chunker-0.2.1.tar.gz.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.1.tar.gz
- Upload date:
- Size: 74.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfff95b6e988c0f820c89f952105b80758ba3f909e64b245b50d937acb759cf5
|
|
| MD5 |
1f3f68613167dd76fd474d0cb9121db3
|
|
| BLAKE2b-256 |
5bcb445dc38c6f8d5f95d6cd8b426f36f828543b028981917bf71695d7180703
|
File details
Details for the file kiwi_pdf_chunker-0.2.1-py3-none-any.whl.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.1-py3-none-any.whl
- Upload date:
- Size: 78.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a7ef14ff6b5bf46f09d7ba7f51ab6e13a80a46078214fe617156566e32785cf
|
|
| MD5 |
601ab26e49d0da8b2ec88dfddc510d44
|
|
| BLAKE2b-256 |
36c431ffb2fcc6c09c9827931230b162c29dbb72d4fab5e732ca1a128108ff31
|