A tool for parsing PDF document layouts and chunking content.

These details have not been verified by PyPI

Project links

Homepage

Project description

CV Document Chunker

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

Features

Convert PDF documents to images for processing.
Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
Process and refine bounding boxes.
Chunk document content based on detected layout.
Generate annotated images showing detected elements.
(Optional) Perform OCR on detected elements using Azure Document Intelligence.
Save structured document data (layouts, chunks, OCR text) in JSON format.

Installation

Prerequisites

Python 3.10+
Pip package manager
(Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

Steps

Clone the Repository (for development or local install):
```
git clone <your-repository-url>
cd cv-doc-parser
```

Create and Activate a Virtual Environment:

python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate  # Windows

Install the Package:
- Editable mode (recommended for development):
```
pip install -e .
```
- Regular install:
```
pip install .
```
- (If published on PyPI):
```
# pip install cv-doc-chunker
```

User-Provided Data

This package requires the user to provide certain data externally:

Input Directory (input/): Place the PDF documents you want to process in a directory (e.g., input/). You will need to provide the path to your input file(s) when using the package.
Models Directory (models/): Download the necessary YOLO model(s) (e.g., doclayout_yolo_docstructbench_imgsz1024.pt) and place them in a dedicated directory (e.g., models/). The path to this directory (or the specific model file) will be needed by the parser.

Usage

(This section needs specific examples based on your library's API)

Provide examples of how to import and use your library functions or the command-line tool.

Example (Conceptual Python Usage):

from cv_doc_chunker import PDFProcessor

# --- User Configuration ---
input_pdf_path = "path/to/your/input/document.pdf" # Path to user's PDF
model_path = "path/to/your/models/doclayout_yolo.pt" # Path to user's model
output_dir = "path/to/your/output/" # Directory to save results

# --- Initialize and Run ---
processor = PDFProcessor(model_path=model_path, output_dir=output_dir)

# Process the document (layout detection, chunking, etc.)
results = processor.process_document(pdf_path=input_pdf_path)

# Optional: Perform OCR (requires Azure setup)
# ocr_results = processor.perform_ocr(results)

print(f"Processing complete. Results saved in {output_dir}")

Example (Conceptual Command-Line Usage):

(Assumes the cv-chunker entry point is configured)

cv-chunker --input path/to/your/input/document.pdf \
           --model path/to/your/models/doclayout_yolo.pt \
           --output path/to/your/output/ \
           [--ocr] [--azure-endpoint YOUR_ENDPOINT] [--azure-key YOUR_KEY]

Note: Update the conceptual examples above with the actual function names, class names, and command-line arguments provided by your cv-doc-chunker package.

Understanding the Output

After running the parser, the following outputs will typically be available in the specified output_dir:

{your-document}_parsed.json: JSON file containing the detected document structure (element labels, coordinates, confidence).
{your-document}_annotations/: Directory containing annotated images showing the detected elements for each page (if generate_annotations=True).
{your-document}_boxes/: Directory containing individual images for each detected element, organized by page number (if save_bounding_boxes=True). This is required for OCR.
{your-document}_sorted_text.json: (Only if ocr=True) JSON file containing the extracted text for each element, sorted according to the structure defined in _parsed.json.

If debug mode is enabled (debug_mode=True), additional debug images might be saved, typically in a debug/ subdirectory within the output_dir, showing intermediate steps of the parsing process.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.2

Apr 30, 2025

0.2.1

Apr 29, 2025

0.2.0

Apr 29, 2025

This version

0.1.1

Apr 28, 2025

0.1.0

Apr 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cv_doc_chunker-0.1.1.tar.gz (25.5 kB view details)

Uploaded Apr 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cv_doc_chunker-0.1.1-py3-none-any.whl (25.3 kB view details)

Uploaded Apr 28, 2025 Python 3

File details

Details for the file cv_doc_chunker-0.1.1.tar.gz.

File metadata

Download URL: cv_doc_chunker-0.1.1.tar.gz
Upload date: Apr 28, 2025
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for cv_doc_chunker-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`79c830d9d578a15a726f95014716f9556d990882d83f73627b55c7cad300100c`
MD5	`75719dcfb3f40bc15470759ee17e7cde`
BLAKE2b-256	`66a4624f58b1e0909e6df765a7656562621b07da0bf0450c8d0f19722011464c`

See more details on using hashes here.

File details

Details for the file cv_doc_chunker-0.1.1-py3-none-any.whl.

File metadata

Download URL: cv_doc_chunker-0.1.1-py3-none-any.whl
Upload date: Apr 28, 2025
Size: 25.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for cv_doc_chunker-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bcb332b8884a0f63d267a62b6c56c2d2637f99df1bbf2de654b639061a3b23fd`
MD5	`afc4aa2a9e22a8a2f891c3aa4aad1770`
BLAKE2b-256	`771813d67634ef387be6df29c9296490f224c6637d8834a9e800c2645da20e95`

See more details on using hashes here.

cv-doc-chunker 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CV Document Chunker

Features

Installation

Prerequisites

Steps

User-Provided Data

Usage

Understanding the Output

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes