Skip to main content

Extract text, images, and tables from PDF files with analysis

Project description

pdfcoordex

Extract text, images, and tables from PDF files with automatic analysis.

Installation

pip install pdfcoordex

Features

  • ✅ Extract all text content from PDFs
  • ✅ Analyze embedded images (size, format, type)
  • ✅ Extract tables with data preservation
  • ✅ Save results to plain text file
  • ✅ Simple and easy to use

Quick Start

from pdfcoordex import PDFCoordExtractor

# Create extractor
extractor = PDFCoordExtractor("document.pdf")

# Save to text file
extractor.save_to_text_file("output.txt")

# Close when done
extractor.close()

Usage Examples

Extract Single Page

from pdfcoordex import PDFCoordExtractor

extractor = PDFCoordExtractor("document.pdf")

# Get page 1 data
page_data = extractor.extract_page(1)

print("Text:", page_data['text'])
print("Images:", len(page_data['images']))
print("Tables:", len(page_data['tables']))

extractor.close()

Extract All Pages

from pdfcoordex import PDFCoordExtractor

extractor = PDFCoordExtractor("document.pdf")

# Get all pages
all_pages = extractor.extract_all_pages()

for page_name, data in all_pages.items():
    print(f"{page_name}: {len(data['text'])} characters")

extractor.close()

Using Context Manager

from pdfcoordex import PDFCoordExtractor

# Automatically closes the PDF
with PDFCoordExtractor("document.pdf") as extractor:
    extractor.save_to_text_file("output.txt")

Output Format

The text file output includes:

Text Content

All text from each page as plain text.

Image Analysis

  • Image format (PNG, JPEG, etc.)
  • Image dimensions (width x height)
  • Orientation (landscape, portrait, square)
  • Type classification (diagram, icon, chart)
  • Color mode information

Tables

Tables are formatted with pipe separators for easy reading:

Header1 | Header2 | Header3
Data1   | Data2   | Data3

Requirements

  • Python >= 3.8
  • pdfplumber >= 0.10.3
  • PyMuPDF >= 1.23.8
  • Pillow >= 10.0.0

License

MIT License

Support

For issues and questions, please visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfcoordex-0.1.2.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfcoordex-0.1.2-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file pdfcoordex-0.1.2.tar.gz.

File metadata

  • Download URL: pdfcoordex-0.1.2.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for pdfcoordex-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6c7279a0bd6e8d1ffa14175437fbf8af1aa57160535328ad5502309231cf1bbb
MD5 19e8367b236ad16036a6584ad5ee5d67
BLAKE2b-256 84eeec7928bd21d1b18f637c384766f757cf0b7eb1a880da4a19c5c5b89cc540

See more details on using hashes here.

File details

Details for the file pdfcoordex-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pdfcoordex-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for pdfcoordex-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3fb33ea460f4b661ec6e67714d742762d7c5c6c6a3a2cb94a3cd9f02a99188ae
MD5 883854cc2cac353d0f077be25f31d7a5
BLAKE2b-256 2a02046e28b281b676aa714ecfefaccf58bc9abab9ac4d52aa20af43a787b07b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page