Skip to main content

Structured document processor with diagram/image/text extraction and dataset output

Project description

📄 doxtract

doxtract is a high-level document preprocessing toolkit that extracts per-page structured metadata from PDFs, DOCX, PPTX, or TXT files — with optional diagram/image detection and support for returning data as a 🤗 HuggingFace datasets.Dataset.


✨ Features

  • 🔍 Detects and skips repeating headers and footers
  • 🧠 Heuristically filters out Table of Contents pages
  • 🖼 Extracts vector diagrams and embedded raster images
  • 📑 Reconstructs clean plain-text or Markdown layouts
  • 🔁 Returns either:
    • A nested Python dictionary (dict[doc_name → list[pages]])
    • A 🤗 datasets.Dataset for ML/NLP pipelines
  • 🚫 Warns on scanned PDFs without OCR — no extraction guesswork

📦 Installation

pip install doxtract

Or for local development:

git clone https://github.com/EthanRyne/Advanced_pdf_extractor
cd Advanced_pdf_extractor
pip install -e .

Make sure you have LibreOffice installed and available as soffice in your PATH (required for .docx, .pptx, .txt conversion).


🧪 Quick Example

from doxtract.processor import preprocess

output = preprocess(
    ["input/spec_sheet.pdf", "notes.docx"],
    markdown=True,               # Output GitHub-flavored Markdown
    extract_vectors=True,        # Extract vector diagrams
    extract_images=True,         # Extract raster images
    strip_headers_footers=True,  # Remove headers/footers from text
    preserve_layout=False,       # If True, use exact spacing from the PDF
    max_workers=None,            # If given, will be used for parallel doc processing
    as_dataset=True              # Return a HuggingFace Dataset
)
print(output)

⚙️ Parameters

Name Type Description
paths list[str] List of input files (.pdf, .docx, .pptx, .txt)
markdown bool If True, output uses GitHub‑flavored Markdown
extract_vectors bool Save and log bounding boxes of detected diagrams
extract_images bool Save visible images per page
output_root str or Path Directory to store outputs and extracted media
strip_headers_footers bool Remove recurring headers/footers from output text
preserve_layout bool If True, use exact spacing from the PDF
max_workers int If given, will be used for parallel doc processing
as_dataset bool Return as HuggingFace datasets.Dataset
(advanced tuning knobs)
vector_margin int Padding around diagrams (in px)
page_top_pct float % height for detecting headers
page_bottom_pct float % height for detecting footers
min_header_pages int Min pages with similar header/footer to consider valid
toc_threshold int TOC detection sensitivity
y_tol int Line grouping tolerance (vertical)
space_thresh int Horizontal gap → one space

🛑 OCR Handling

If a PDF is detected to be a scanned document with no embedded text, doxtract will abort the run with a warning:

⚠️ scanned_file.pdf looks like a scanned PDF with no text layer. Please run OCR first; aborting.

To preprocess such files, run OCR first using OCRmyPDF or similar tools.


📁 Output Example (simplified)

Each output "page" is a dictionary with:

{
  "document_name": "spec.pdf",
  "page_number": 3,
  "page_content": "...",
  "is_toc_page": false,
  "headers": ["My Spec Sheet"],
  "footers": [],
  "diagrams": [
    {"path": "Doc Data/spec/diagrams/p003_1.png", "bbox": [12.1, 55.2, 430.6, 310.4]}
  ],
  "images_on_this_page": [
    "Doc Data/spec/images/p003_xref12.png"
  ]
}

🤗 Dataset Mode

If as_dataset=True, the output is a HuggingFace-compatible datasets.Dataset, ideal for training/evaluation workflows:

from datasets import Dataset

ds = preprocess(["spec.pdf"], as_dataset=True)
print(ds[0]["page_content"])

🧱 Dependencies

  • PyMuPDF (fitz)
  • tqdm
  • datasets (optional, for dataset output)
  • LibreOffice (soffice) for office conversion

🧑‍💻 License

MIT License © 2025


📬 Contributing

Pull requests welcome! For major changes, please open an issue first to discuss what you’d like to change or improve.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doxtract-0.0.6.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doxtract-0.0.6-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file doxtract-0.0.6.tar.gz.

File metadata

  • Download URL: doxtract-0.0.6.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doxtract-0.0.6.tar.gz
Algorithm Hash digest
SHA256 d3c3f5b8995f5385facd97fb90278d712753c95500fdf8f4361630c41b6df753
MD5 31bf0ec998a48b9a47d074aaecfcde8f
BLAKE2b-256 a003b129eb64a69dc2d434a7955d0d3ff4381d6db71a1f8f24057008dc80d7b8

See more details on using hashes here.

File details

Details for the file doxtract-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: doxtract-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doxtract-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 675074a4f41f934f80b3b503f27a4a3724a79a673d040c5f313602a01b652a7f
MD5 b075d6c67c1a11095f76a586746d2d17
BLAKE2b-256 8bf7e34f5b62289d61fd44509e1380245d7f93706fd2099e2c205928bb79367e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page