A tool for parsing PDF document layouts and chunking content
Project description
PDF Parser
A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.
Features
- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- (Optional) Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder
Installation
Prerequisites
- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.
Steps
- Install the Package:
# pip install kiwi-pdf-chunker
User-Provided Data
This package requires the user to provide certain data externally:
- Input Directory (
input/): Place the PDF documents you want to process in a directory (e.g.,input/). You will need to provide the path to your input file(s) when using the package. - Models Directory (
models/): Download the necessary YOLO model(s) (e.g.,doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g.,models/). The path to this directory (or the specific model file) will be needed by the parser.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kiwi_pdf_chunker-0.2.4.tar.gz
(39.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kiwi_pdf_chunker-0.2.4.tar.gz.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.4.tar.gz
- Upload date:
- Size: 39.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb4aa4a8aa55b594919346f49ed53c3dd2257f2cd43d148fff97ce22c91ccad9
|
|
| MD5 |
a987d0adc96059564934afeeb8955996
|
|
| BLAKE2b-256 |
b6f8bc37429e1880bdc469480e05c1959e0f58ee523676e7af11346526cee123
|
File details
Details for the file kiwi_pdf_chunker-0.2.4-py3-none-any.whl.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.4-py3-none-any.whl
- Upload date:
- Size: 40.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd449547bbbe5ebcf6b2bba1e90991bd9a877eb44b04806d2d732f14425edc26
|
|
| MD5 |
88d8525933fe167cc202027adcad3e76
|
|
| BLAKE2b-256 |
52f0de0654ea32e52e0a9be87e8d3c2a3e059ae2988a8a4ff913a0e352093487
|