Skip to main content

Extract text, images, and tables from PDF files with analysis

Project description

pdfcoordex

Extract text, images, and tables from PDF files with automatic analysis.

Installation

pip install pdfcoordex

Features

  • ✅ Extract all text content from PDFs
  • ✅ Analyze embedded images (size, format, type)
  • ✅ Extract tables with data preservation
  • ✅ Save results to plain text file
  • ✅ Simple and easy to use

Quick Start

from pdfcoordex import PDFCoordExtractor

# Create extractor
extractor = PDFCoordExtractor("document.pdf")

# Save to text file
extractor.save_to_text_file("output.txt")

# Close when done
extractor.close()

Usage Examples

Extract Single Page

from pdfcoordex import PDFCoordExtractor

extractor = PDFCoordExtractor("document.pdf")

# Get page 1 data
page_data = extractor.extract_page(1)

print("Text:", page_data['text'])
print("Images:", len(page_data['images']))
print("Tables:", len(page_data['tables']))

extractor.close()

Extract All Pages

from pdfcoordex import PDFCoordExtractor

extractor = PDFCoordExtractor("document.pdf")

# Get all pages
all_pages = extractor.extract_all_pages()

for page_name, data in all_pages.items():
    print(f"{page_name}: {len(data['text'])} characters")

extractor.close()

Using Context Manager

from pdfcoordex import PDFCoordExtractor

# Automatically closes the PDF
with PDFCoordExtractor("document.pdf") as extractor:
    extractor.save_to_text_file("output.txt")

Output Format

The text file output includes:

Text Content

All text from each page as plain text.

Image Analysis

  • Image format (PNG, JPEG, etc.)
  • Image dimensions (width x height)
  • Orientation (landscape, portrait, square)
  • Type classification (diagram, icon, chart)
  • Color mode information

Tables

Tables are formatted with pipe separators for easy reading:

Header1 | Header2 | Header3
Data1   | Data2   | Data3

Requirements

  • Python >= 3.8
  • pdfplumber >= 0.10.3
  • PyMuPDF >= 1.23.8
  • Pillow >= 10.0.0

License

MIT License

Support

For issues and questions, please visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfcoordex-0.1.1.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfcoordex-0.1.1-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file pdfcoordex-0.1.1.tar.gz.

File metadata

  • Download URL: pdfcoordex-0.1.1.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for pdfcoordex-0.1.1.tar.gz
Algorithm Hash digest
SHA256 faef88aeb8b146b3d9e977dd8252fadc1fc6be135628310f938664b1de8b9b40
MD5 3d5d422367f6e6118b428c91bef8875b
BLAKE2b-256 f2a78e966d1fae7b6a3170d160a30b76a53b19284c56de36e97468ad97a8816e

See more details on using hashes here.

File details

Details for the file pdfcoordex-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdfcoordex-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for pdfcoordex-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 de14b134d849fcde4ecdcab3d4f4147564a2fbb6d3fd30e599f4c98a33cc151e
MD5 0741f62ff4970d47b2baf4e2db4254c1
BLAKE2b-256 49ae3c7f4b72bd240ee36e134767cdb313075469eb9c6cb588811a0436b60a83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page