Skip to main content

A library for processing PDF documents, images, extracting text, parsing TSV to JSON, and merging JSON files

Project description

PDF Processing Library

This library provides tools for processing PDF documents, images, extracting text, parsing TSV files to JSON, and merging JSON files. It includes functionality for text extraction, image conversion, table detection, object detection using YOLO, CVAT task XML generation, TSV to JSON parsing, and JSON merging with hash generation.

Installation

pip install pdf_processing_lib

Usage

from pdf_processing_lib import PDFProcessor, ImageProcessor, TextExtractor, TSVtoJSONParser, JSONMerger

# Process PDFs
pdf_processor = PDFProcessor('path/to/input/directory', 'path/to/output/directory')
pdf_processor.process_directory()

# Process images
image_processor = ImageProcessor('path/to/yolo/model.pt')
image_processor.process_directory('path/to/output/directory')

# Extract text and create CVAT task XML
text_extractor = TextExtractor('path/to/output/directory')
total_files, total_time, cvat_xml_path = text_extractor.process_directory_for_text_extraction()

# Parse TSV files to JSON
tsv_parser = TSVtoJSONParser('path/to/output/directory')
total_files, total_time, avg_time = tsv_parser.process_all_final_directories()

# Merge JSON files and add hash
json_merger = JSONMerger('path/to/json/directory')
output_file, total_entries = json_merger.run('path/to/output/merged.json')

Features

  • Extract text and tables from PDFs
  • Convert PDF pages to JPG images
  • Create versions of PDFs with tables covered
  • Process multiple PDFs in parallel
  • Perform object detection on images using YOLO
  • Process images in batches for efficient memory usage
  • Extract text from specific regions in PDFs
  • Generate CVAT task XML for annotation purposes
  • Parse TSV files to structured JSON format
  • Merge multiple JSON files into a single file with added hash keys

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_processing_lib-0.2.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

pdf_processing_lib-0.2.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_processing_lib-0.2.0.tar.gz.

File metadata

  • Download URL: pdf_processing_lib-0.2.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for pdf_processing_lib-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7ff7b59c61fad1a1c975457ec0b929d1d3bed5f470a13a4df02560465b248247
MD5 811d2529ee3dd3413183a02d3f7f13d8
BLAKE2b-256 b7c80b06ce6b4f230c916806bc535483ddf42a034c4a906dddd271548074ee2c

See more details on using hashes here.

File details

Details for the file pdf_processing_lib-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_processing_lib-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0846fd38ce4e52bdc0aff7d58558b519b96fd2336fff58d77ae22b87f52c8351
MD5 bf6c18b5c755051b4eaba4f77acc052e
BLAKE2b-256 ffc33b9c5b29a1ed9ad2b6f9e43017de6c7ac394ab072b6e44006bec4ee1426b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page