A tool for parsing PDF document layouts and chunking content
Project description
PDF Parser
A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.
Features
- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- (Optional) Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder
Installation
Prerequisites
- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.
Steps
- Install the Package:
# pip install kiwi-pdf-chunker
User-Provided Data
This package requires the user to provide certain data externally:
- Input Directory (
input/): Place the PDF documents you want to process in a directory (e.g.,input/). You will need to provide the path to your input file(s) when using the package. - Models Directory (
models/): Download the necessary YOLO model(s) (e.g.,doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g.,models/). The path to this directory (or the specific model file) will be needed by the parser.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kiwi_pdf_chunker-0.2.3.tar.gz
(39.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kiwi_pdf_chunker-0.2.3.tar.gz.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.3.tar.gz
- Upload date:
- Size: 39.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c5f76480b8bb96c89a4fd7ce27aca3a818345973e646ec1a91d2cda6e4eb1e3
|
|
| MD5 |
6f052b06ec4ded27760b32e10f0c2264
|
|
| BLAKE2b-256 |
2f0d3ffffcb4b08caffc95bf8d037c2e91238aa36162fb4257f7fd92ae1cf08d
|
File details
Details for the file kiwi_pdf_chunker-0.2.3-py3-none-any.whl.
File metadata
- Download URL: kiwi_pdf_chunker-0.2.3-py3-none-any.whl
- Upload date:
- Size: 40.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36a1792a03d744af9bf64a7f3d54d7f271c8091e6a34cd39b04b4b9d1139fc56
|
|
| MD5 |
cbc706bce2281b0a4f5c2e2c42024ed5
|
|
| BLAKE2b-256 |
55696373da0316f71a9e0a169b46edcfed4550055f7cf4d8d385aac4fd543075
|