Extract text, images, and tables from PDF files with analysis
Project description
pdfcoordex
Extract text, images, and tables from PDF files with automatic analysis.
Installation
pip install pdfcoordex
Features
- ✅ Extract all text content from PDFs
- ✅ Analyze embedded images (size, format, type)
- ✅ Extract tables with data preservation
- ✅ Save results to plain text file
- ✅ Simple and easy to use
Quick Start
from pdfcoordex import PDFCoordExtractor
# Create extractor
extractor = PDFCoordExtractor("document.pdf")
# Save to text file
extractor.save_to_text_file("output.txt")
# Close when done
extractor.close()
Usage Examples
Extract Single Page
from pdfcoordex import PDFCoordExtractor
extractor = PDFCoordExtractor("document.pdf")
# Get page 1 data
page_data = extractor.extract_page(1)
print("Text:", page_data['text'])
print("Images:", len(page_data['images']))
print("Tables:", len(page_data['tables']))
extractor.close()
Extract All Pages
from pdfcoordex import PDFCoordExtractor
extractor = PDFCoordExtractor("document.pdf")
# Get all pages
all_pages = extractor.extract_all_pages()
for page_name, data in all_pages.items():
print(f"{page_name}: {len(data['text'])} characters")
extractor.close()
Using Context Manager
from pdfcoordex import PDFCoordExtractor
# Automatically closes the PDF
with PDFCoordExtractor("document.pdf") as extractor:
extractor.save_to_text_file("output.txt")
Output Format
The text file output includes:
Text Content
All text from each page as plain text.
Image Analysis
- Image format (PNG, JPEG, etc.)
- Image dimensions (width x height)
- Orientation (landscape, portrait, square)
- Type classification (diagram, icon, chart)
- Color mode information
Tables
Tables are formatted with pipe separators for easy reading:
Header1 | Header2 | Header3
Data1 | Data2 | Data3
Requirements
- Python >= 3.8
- pdfplumber >= 0.10.3
- PyMuPDF >= 1.23.8
- Pillow >= 10.0.0
License
MIT License
Support
For issues and questions, please visit the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfcoordex-0.1.2.tar.gz.
File metadata
- Download URL: pdfcoordex-0.1.2.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c7279a0bd6e8d1ffa14175437fbf8af1aa57160535328ad5502309231cf1bbb
|
|
| MD5 |
19e8367b236ad16036a6584ad5ee5d67
|
|
| BLAKE2b-256 |
84eeec7928bd21d1b18f637c384766f757cf0b7eb1a880da4a19c5c5b89cc540
|
File details
Details for the file pdfcoordex-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pdfcoordex-0.1.2-py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fb33ea460f4b661ec6e67714d742762d7c5c6c6a3a2cb94a3cd9f02a99188ae
|
|
| MD5 |
883854cc2cac353d0f077be25f31d7a5
|
|
| BLAKE2b-256 |
2a02046e28b281b676aa714ecfefaccf58bc9abab9ac4d52aa20af43a787b07b
|