A library for processing PDF documents, images, extracting text, parsing TSV to JSON, and merging JSON files
Project description
PDF Processing Library
This library provides tools for processing PDF documents, images, extracting text, parsing TSV files to JSON, and merging JSON files. It includes functionality for text extraction, image conversion, table detection, object detection using YOLO, CVAT task XML generation, TSV to JSON parsing, and JSON merging with hash generation.
Installation
pip install pdf_processing_lib
Usage
from pdf_processing_lib import PDFProcessor, ImageProcessor, TextExtractor, TSVtoJSONParser, JSONMerger
# Process PDFs
pdf_processor = PDFProcessor('path/to/input/directory', 'path/to/output/directory')
pdf_processor.process_directory()
# Process images
image_processor = ImageProcessor('path/to/yolo/model.pt')
image_processor.process_directory('path/to/output/directory')
# Extract text and create CVAT task XML
text_extractor = TextExtractor('path/to/output/directory')
total_files, total_time, cvat_xml_path = text_extractor.process_directory_for_text_extraction()
# Parse TSV files to JSON
tsv_parser = TSVtoJSONParser('path/to/output/directory')
total_files, total_time, avg_time = tsv_parser.process_all_final_directories()
# Merge JSON files and add hash
json_merger = JSONMerger('path/to/json/directory')
output_file, total_entries = json_merger.run('path/to/output/merged.json')
Features
- Extract text and tables from PDFs
- Convert PDF pages to JPG images
- Create versions of PDFs with tables covered
- Process multiple PDFs in parallel
- Perform object detection on images using YOLO
- Process images in batches for efficient memory usage
- Extract text from specific regions in PDFs
- Generate CVAT task XML for annotation purposes
- Parse TSV files to structured JSON format
- Merge multiple JSON files into a single file with added hash keys
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf_processing_lib-0.5.0.tar.gz
.
File metadata
- Download URL: pdf_processing_lib-0.5.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffc6d0927fa8ee77713cd5cd9702f79f06295be21bbd462655f565e472744da2 |
|
MD5 | 00f61992c5a882b8181dce65f7c310a0 |
|
BLAKE2b-256 | 6c15077ba7446cddf0a69215e3f464c366a443996534c402668af493b523d5a0 |
File details
Details for the file pdf_processing_lib-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: pdf_processing_lib-0.5.0-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f9ac9d6f881ed4dfa51611ca9bfcff77bf7bf647608f9812c002cf2ba48e224 |
|
MD5 | 35826df576a7090c331d2783ea6626a1 |
|
BLAKE2b-256 | 3115d04041ca8557dea973c605378f4768e5957deabf79976a9317bfe5167935 |