Structured document processor with diagram/image/text extraction with optional langchain or dataset output

These details have not been verified by PyPI

Project links

Homepage

Project description

📄 doxtract

doxtract is a high-level document preprocessing toolkit that extracts per-page structured metadata from PDFs, DOCX, PPTX, or TXT files — with optional diagram/image detection and native support for RAG pipelines via 🦜 LangChain and 🤗 HuggingFace datasets.Dataset.

✨ Features

🔍 Detects and skips repeating headers and footers
🧠 Heuristically filters out Table of Contents pages
🖼 Extracts vector diagrams and embedded raster images
📑 Reconstructs clean plain-text or Markdown layouts
🦜 LangChain Integration: Native export to langchain_core.documents.Document with enriched metadata.
🔁 Flexible Return Formats:
- A nested Python dictionary (dict[doc_name → list[pages]])
- A 🤗 datasets.Dataset for ML/NLP pipelines.
- A list of LangChain Document objects.
🚫 Warns on scanned PDFs without OCR — no extraction guesswork

📦 Installation

pip install doxtract

Or for local development:

git clone https://github.com/EthanRyne/Advanced_pdf_extractor
cd Advanced_pdf_extractor
pip install -e .

Make sure you have LibreOffice installed and available as soffice in your PATH (required for .docx, .pptx, .txt conversion).

🧪 Quick Example

from doxtract.processor import preprocess

output = preprocess(
    ["input/spec_sheet.pdf", "notes.docx"],
    markdown=True,               # Output GitHub-flavored Markdown
    extract_vectors=True,        # Extract vector diagrams
    extract_images=True,         # Extract raster images
    strip_headers_footers=True,  # Remove headers/footers from text
    preserve_layout=True,       # If True, use exact spacing from the PDF
    max_workers=None,            # If given, will be used for parallel doc processing
    as_dataset=True              # Return a HuggingFace Dataset
)
print(output)

⚙️ Parameters

Name	Type	Description
`paths`	`list[str]`	List of input files (`.pdf`, `.docx`, `.pptx`, `.txt`)
`markdown`	`bool`	If `True`, output uses GitHub‑flavored Markdown
`extract_vectors`	`bool`	Save and log bounding boxes of detected diagrams
`extract_images`	`bool`	Save visible images per page
`output_root`	`str or Path`	Directory to store outputs and extracted media
`strip_headers_footers`	`bool`	Remove recurring headers/footers from output text
`preserve_layout`	`bool`	If True, use exact spacing from the PDF
`max_workers`	`int`	If given, will be used for parallel doc processing
`as_langchain_docs`	`bool`	Return as a list of `langchain_core.documents.Document` objects
`as_dataset`	`bool`	Return as HuggingFace `datasets.Dataset`
(advanced tuning knobs)
`vector_margin`	`int`	Padding around diagrams (in px)
`page_top_pct`	`float`	% height for detecting headers
`page_bottom_pct`	`float`	% height for detecting footers
`min_header_pages`	`int`	Min pages with similar header/footer to consider valid
`toc_threshold`	`int`	TOC detection sensitivity
`y_tol`	`int`	Line grouping tolerance (vertical)
`space_thresh`	`int`	Horizontal gap → one space

🛑 OCR Handling

If a PDF is detected to be a scanned document with no embedded text, doxtract will abort the run with a warning:

⚠️ scanned_file.pdf looks like a scanned PDF with no text layer. Please run OCR first; aborting.

To preprocess such files, run OCR first using OCRmyPDF or similar tools.

📁 Output Example (simplified)

Each output "page" is a dictionary with:

{
  "document_name": "spec.pdf",
  "page_number": 3,
  "page_content": "...",
  "is_toc_page": false,
  "headers": ["My Spec Sheet"],
  "footers": [],
  "diagrams": [
    {"path": "Doc Data/spec/diagrams/p003_1.png", "bbox": [12.1, 55.2, 430.6, 310.4]}
  ],
  "images_on_this_page": [
    "Doc Data/spec/images/p003_xref12.png"
  ]
}

📑 Metadata & LangChain Compatibility

When using as_langchain_docs=True, doxtract automatically enriches the metadata to match industry standards, ensuring your RAG citations are accurate:

Metadata Key	Description
`source`	The full path to the source file
`page`	The 0-indexed page number
`total_pages`	Total pages in the document
`creationdate`	Normalized PDF creation timestamp
`moddate`	Normalized PDF modification timestamp
`title/author`	Metadata extracted from the PDF header
`is_toc_page`	Boolean flag indicating if the page is a TOC

🤗 Dataset Mode

If as_dataset=True, the output is a HuggingFace-compatible datasets.Dataset, ideal for training/evaluation workflows:

from doxtract.processor import preprocess

ds = preprocess(["spec.pdf"], as_dataset=True)
print(ds[0]["page_content"])

🦜 Langchain Mode

If as_langchain_docs=True, the output is a Langchain-compatible Document, ideal for langchain RAG pipeline workflows:

from doxtract.processor import preprocess

output = preprocess(
    ["spec.pdf"],
    markdown=True,
    preserve_layout=True,
    as_langchain_docs=True        
)

# You can immediately plug into a Vector Store
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=output, embedding=OpenAIEmbeddings())

🧱 Dependencies

PyMuPDF (fitz)
langchain (optional, for LangChain output)
datasets (optional, for dataset output)
LibreOffice (soffice) for office conversion

🧑‍💻 License

📬 Contributing

Pull requests welcome! For major changes, please open an issue first to discuss what you’d like to change or improve.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.9

Apr 13, 2026

0.0.8

Mar 3, 2026

0.0.7

Mar 3, 2026

0.0.6

Mar 3, 2026

0.0.5

Jul 7, 2025

0.0.4

Jul 7, 2025

0.0.3

Jul 7, 2025

0.0.2

Jul 7, 2025

0.0.1

Jul 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doxtract-0.0.9.tar.gz (15.8 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doxtract-0.0.9-py3-none-any.whl (14.8 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file doxtract-0.0.9.tar.gz.

File metadata

Download URL: doxtract-0.0.9.tar.gz
Upload date: Apr 13, 2026
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doxtract-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`0fb993d21ebfae57b1da93d00c5d26e423ac470ac9567850ee17a18d4e9c24ef`
MD5	`e0167f8375ba72d02f3bc2bbd3ec4f38`
BLAKE2b-256	`1b4cf4b424ed7164e161f93f72afe7f426d08dc4e7ab9ce0efd95b09a2b79285`

See more details on using hashes here.

File details

Details for the file doxtract-0.0.9-py3-none-any.whl.

File metadata

Download URL: doxtract-0.0.9-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 14.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doxtract-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e6874dd11cfeffc95348c7548269a2ebe18d437fa5fb193d64ce65d0307a4be`
MD5	`0b0b604e7df9110d213fd489e7defa2e`
BLAKE2b-256	`8ba36873367ef25250694de732901b9a4a03d3cf5def7649c66d3553f240e4c1`

See more details on using hashes here.

doxtract 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📄 doxtract

✨ Features

📦 Installation

🧪 Quick Example

⚙️ Parameters

🛑 OCR Handling

📁 Output Example (simplified)

📑 Metadata & LangChain Compatibility

🤗 Dataset Mode

🦜 Langchain Mode

🧱 Dependencies

🧑‍💻 License

📬 Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes