A tool for parsing PDF document layouts and chunking content
Project description
PDF Parser
A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.
Features
- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- (Optional) Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder
Installation
Prerequisites
- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.
Steps
- Install the Package:
# pip install kiwi-pdf-chunker
User-Provided Data
This package requires the user to provide certain data externally:
- Input Directory (
input/): Place the PDF documents you want to process in a directory (e.g.,input/). You will need to provide the path to your input file(s) when using the package. - Models Directory (
models/): Download the necessary YOLO model(s) (e.g.,doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g.,models/). The path to this directory (or the specific model file) will be needed by the parser.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kiwi_pdf_chunker-0.2.5.tar.gz
(39.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kiwi_pdf_chunker-0.2.5.tar.gz.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.5.tar.gz
- Upload date:
- Size: 39.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34a044420c3c9fa804600e9a76034c99f63ffa27570f4741736753e006e7f869
|
|
| MD5 |
99105d0cae35affe19ae83feda9f2833
|
|
| BLAKE2b-256 |
14e5d06585429c5931fe85e9cb2ce4587111203d99014f412bc7b7ae451abf34
|
File details
Details for the file kiwi_pdf_chunker-0.2.5-py3-none-any.whl.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.5-py3-none-any.whl
- Upload date:
- Size: 40.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
285426bdabb6e6efc4cd086c7cb0ccae80af5e7873741ed976beef13d87af6be
|
|
| MD5 |
086812da244c37a7f141bd3831882b37
|
|
| BLAKE2b-256 |
f2889e389d70834515af502ee3eecfa01df23e9ee305150f8e1e54506249d06a
|