Skip to main content

Structured document processor with diagram/image/text extraction with optional langchain or dataset output

Project description

📄 doxtract

doxtract is a high-level document preprocessing toolkit that extracts per-page structured metadata from PDFs, DOCX, PPTX, or TXT files — with optional diagram/image detection and native support for RAG pipelines via 🦜 LangChain and 🤗 HuggingFace datasets.Dataset.


✨ Features

  • 🔍 Detects and skips repeating headers and footers
  • 🧠 Heuristically filters out Table of Contents pages
  • 🖼 Extracts vector diagrams and embedded raster images
  • 📑 Reconstructs clean plain-text or Markdown layouts
  • 🦜 LangChain Integration: Native export to langchain_core.documents.Document with enriched metadata.
  • 🔁 Flexible Return Formats:
    • A nested Python dictionary (dict[doc_name → list[pages]])
    • A 🤗 datasets.Dataset for ML/NLP pipelines.
    • A list of LangChain Document objects.
  • 🚫 Warns on scanned PDFs without OCR — no extraction guesswork

📦 Installation

pip install doxtract

Or for local development:

git clone https://github.com/EthanRyne/Advanced_pdf_extractor
cd Advanced_pdf_extractor
pip install -e .

Make sure you have LibreOffice installed and available as soffice in your PATH (required for .docx, .pptx, .txt conversion).


🧪 Quick Example

from doxtract.processor import preprocess

output = preprocess(
    ["input/spec_sheet.pdf", "notes.docx"],
    markdown=True,               # Output GitHub-flavored Markdown
    extract_vectors=True,        # Extract vector diagrams
    extract_images=True,         # Extract raster images
    strip_headers_footers=True,  # Remove headers/footers from text
    preserve_layout=True,       # If True, use exact spacing from the PDF
    max_workers=None,            # If given, will be used for parallel doc processing
    as_dataset=True              # Return a HuggingFace Dataset
)
print(output)

⚙️ Parameters

Name Type Description
paths list[str] List of input files (.pdf, .docx, .pptx, .txt)
markdown bool If True, output uses GitHub‑flavored Markdown
extract_vectors bool Save and log bounding boxes of detected diagrams
extract_images bool Save visible images per page
output_root str or Path Directory to store outputs and extracted media
strip_headers_footers bool Remove recurring headers/footers from output text
preserve_layout bool If True, use exact spacing from the PDF
max_workers int If given, will be used for parallel doc processing
as_langchain_docs bool Return as a list of langchain_core.documents.Document objects
as_dataset bool Return as HuggingFace datasets.Dataset
(advanced tuning knobs)
vector_margin int Padding around diagrams (in px)
page_top_pct float % height for detecting headers
page_bottom_pct float % height for detecting footers
min_header_pages int Min pages with similar header/footer to consider valid
toc_threshold int TOC detection sensitivity
y_tol int Line grouping tolerance (vertical)
space_thresh int Horizontal gap → one space

🛑 OCR Handling

If a PDF is detected to be a scanned document with no embedded text, doxtract will abort the run with a warning:

⚠️ scanned_file.pdf looks like a scanned PDF with no text layer. Please run OCR first; aborting.

To preprocess such files, run OCR first using OCRmyPDF or similar tools.


📁 Output Example (simplified)

Each output "page" is a dictionary with:

{
  "document_name": "spec.pdf",
  "page_number": 3,
  "page_content": "...",
  "is_toc_page": false,
  "headers": ["My Spec Sheet"],
  "footers": [],
  "diagrams": [
    {"path": "Doc Data/spec/diagrams/p003_1.png", "bbox": [12.1, 55.2, 430.6, 310.4]}
  ],
  "images_on_this_page": [
    "Doc Data/spec/images/p003_xref12.png"
  ]
}

📑 Metadata & LangChain Compatibility

When using as_langchain_docs=True, doxtract automatically enriches the metadata to match industry standards, ensuring your RAG citations are accurate:

Metadata Key Description
source The full path to the source file
page The 0-indexed page number
total_pages Total pages in the document
creationdate Normalized PDF creation timestamp
moddate Normalized PDF modification timestamp
title/author Metadata extracted from the PDF header
is_toc_page Boolean flag indicating if the page is a TOC

🤗 Dataset Mode

If as_dataset=True, the output is a HuggingFace-compatible datasets.Dataset, ideal for training/evaluation workflows:

from doxtract.processor import preprocess

ds = preprocess(["spec.pdf"], as_dataset=True)
print(ds[0]["page_content"])

🦜 Langchain Mode

If as_langchain_docs=True, the output is a Langchain-compatible Document, ideal for langchain RAG pipeline workflows:

from doxtract.processor import preprocess

output = preprocess(
    ["spec.pdf"],
    markdown=True,
    preserve_layout=True,
    as_langchain_docs=True        
)

# You can immediately plug into a Vector Store
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=output, embedding=OpenAIEmbeddings())

🧱 Dependencies

  • PyMuPDF (fitz)
  • langchain (optional, for LangChain output)
  • datasets (optional, for dataset output)
  • LibreOffice (soffice) for office conversion

🧑‍💻 License

MIT License © 2025


📬 Contributing

Pull requests welcome! For major changes, please open an issue first to discuss what you’d like to change or improve.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doxtract-0.0.9.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doxtract-0.0.9-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file doxtract-0.0.9.tar.gz.

File metadata

  • Download URL: doxtract-0.0.9.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doxtract-0.0.9.tar.gz
Algorithm Hash digest
SHA256 0fb993d21ebfae57b1da93d00c5d26e423ac470ac9567850ee17a18d4e9c24ef
MD5 e0167f8375ba72d02f3bc2bbd3ec4f38
BLAKE2b-256 1b4cf4b424ed7164e161f93f72afe7f426d08dc4e7ab9ce0efd95b09a2b79285

See more details on using hashes here.

File details

Details for the file doxtract-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: doxtract-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doxtract-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 0e6874dd11cfeffc95348c7548269a2ebe18d437fa5fb193d64ce65d0307a4be
MD5 0b0b604e7df9110d213fd489e7defa2e
BLAKE2b-256 8ba36873367ef25250694de732901b9a4a03d3cf5def7649c66d3553f240e4c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page