Structured document processor with diagram/image/text extraction and dataset output
Project description
📄 doxtract
doxtract is a high-level document preprocessing toolkit that extracts per-page structured metadata from PDFs, DOCX, PPTX, or TXT files — with optional diagram/image detection and support for returning data as a 🤗 HuggingFace datasets.Dataset.
✨ Features
- 🔍 Detects and skips repeating headers and footers
- 🧠 Heuristically filters out Table of Contents pages
- 🖼 Extracts vector diagrams and embedded raster images
- 📑 Reconstructs clean plain-text or Markdown layouts
- 🔁 Returns either:
- A nested Python dictionary (
dict[doc_name → list[pages]]) - A 🤗
datasets.Datasetfor ML/NLP pipelines
- A nested Python dictionary (
- 🚫 Warns on scanned PDFs without OCR — no extraction guesswork
📦 Installation
pip install doxtract
Or for local development:
git clone https://github.com/EthanRyne/Advanced_pdf_extractor
cd Advanced_pdf_extractor
pip install -e .
Make sure you have LibreOffice installed and available as soffice in your PATH (required for .docx, .pptx, .txt conversion).
🧪 Quick Example
from doxtract.processor import preprocess
output = preprocess(
["input/spec_sheet.pdf", "notes.docx"],
markdown=True, # Output GitHub-flavored Markdown
extract_vectors=True, # Extract vector diagrams
extract_images=True, # Extract raster images
strip_headers_footers=True, # Remove headers/footers from text
preserve_layout=False, # If True, use exact spacing from the PDF
as_dataset=True # Return a HuggingFace Dataset
)
print(output)
⚙️ Parameters
| Name | Type | Description |
|---|---|---|
paths |
list[str] |
List of input files (.pdf, .docx, .pptx, .txt) |
markdown |
bool |
If True, output uses GitHub‑flavored Markdown |
extract_vectors |
bool |
Save and log bounding boxes of detected diagrams |
extract_images |
bool |
Save visible images per page |
output_root |
str or Path |
Directory to store outputs and extracted media |
strip_headers_footers |
bool |
Remove recurring headers/footers from output text |
preserve_layout |
bool |
If True, use exact spacing from the PDF |
as_dataset |
bool |
Return as HuggingFace datasets.Dataset |
| (advanced tuning knobs) | ||
vector_margin |
int |
Padding around diagrams (in px) |
page_top_pct |
float |
% height for detecting headers |
page_bottom_pct |
float |
% height for detecting footers |
min_header_pages |
int |
Min pages with similar header/footer to consider valid |
toc_threshold |
int |
TOC detection sensitivity |
y_tol |
int |
Line grouping tolerance (vertical) |
space_thresh |
int |
Horizontal gap → one space |
🛑 OCR Handling
If a PDF is detected to be a scanned document with no embedded text, doxtract will abort the run with a warning:
⚠️
scanned_file.pdflooks like a scanned PDF with no text layer. Please run OCR first; aborting.
To preprocess such files, run OCR first using OCRmyPDF or similar tools.
📁 Output Example (simplified)
Each output "page" is a dictionary with:
{
"document_name": "spec.pdf",
"page_number": 3,
"page_content": "...",
"is_toc_page": false,
"headers": ["My Spec Sheet"],
"footers": [],
"diagrams": [
{"path": "Doc Data/spec/diagrams/p003_1.png", "bbox": [12.1, 55.2, 430.6, 310.4]}
],
"images_on_this_page": [
"Doc Data/spec/images/p003_xref12.png"
]
}
🤗 Dataset Mode
If as_dataset=True, the output is a HuggingFace-compatible datasets.Dataset, ideal for training/evaluation workflows:
from datasets import Dataset
ds = preprocess(["spec.pdf"], as_dataset=True)
print(ds[0]["page_content"])
🧱 Dependencies
PyMuPDF(fitz)tqdmdatasets(optional, for dataset output)- LibreOffice (
soffice) for office conversion
🧑💻 License
MIT License © 2025
📬 Contributing
Pull requests welcome! For major changes, please open an issue first to discuss what you’d like to change or improve.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doxtract-0.0.2.tar.gz.
File metadata
- Download URL: doxtract-0.0.2.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26903b673832bbd0ffad0ddba25485f3d18a217067adb295e6b4e920e87a7017
|
|
| MD5 |
55ff1bf3eb01b0cd389a5abe7c347458
|
|
| BLAKE2b-256 |
ae8fcf444b3cac4a26a7313b2c1fe35e0ceb7a72fadf685d1442e0a7809ffe83
|
File details
Details for the file doxtract-0.0.2-py3-none-any.whl.
File metadata
- Download URL: doxtract-0.0.2-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f87ddb75ee4dbdd3f3de90edfdb590cf932a77c57e558fd75c22f231d780d519
|
|
| MD5 |
e661e6201fd7cd4b1d04857c8f481f09
|
|
| BLAKE2b-256 |
76bba7a6e741f419958f0d5719b041b043b9a40b0ced10808e6889b64419f9d2
|