Skip to main content

Layout-aware document parser for structured LLM-ready JSON

Project description

DocuWeave

RAG document compiler for PDFs: layout-aware parsing -> section hierarchy -> token-aware chunks -> vector-ready outputs.

DocuWeave helps you convert raw PDFs into structured context that performs better in retrieval pipelines.

Why DocuWeave

Most basic PDF loaders return flat text and lose structure. DocuWeave preserves document shape so retrieval can be more accurate and explainable.

  • Layout-aware block parsing using PyMuPDF
  • Automatic hierarchy construction from heading signals
  • Token-aware chunking for embedding workflows
  • Rich chunk metadata (section_path, page span, chunk links)
  • Export paths for Pinecone, Weaviate, and FAISS-style JSONL
  • LangChain and LlamaIndex-friendly adapters

Installation

pip install docuweave

Requires Python 3.9+.

Install optional integration dependencies:

pip install "docuweave[integrations]"

Quick Start (Python)

from docuweave import parse

doc = parse("sample.pdf")

chunks = doc.to_chunks(max_tokens=500)
doc.save_json("output.json")

print(len(doc.get_sections()), len(chunks))

Quick Start (CLI)

docuweave sample.pdf -o output.json --max-tokens 500

Vector export modes:

docuweave sample.pdf --export pinecone -o pinecone_records.json
docuweave sample.pdf --export weaviate -o weaviate_records.json
docuweave sample.pdf --export faiss-jsonl -o faiss_records.jsonl

Output Shape

Chunks include retrieval-friendly metadata:

{
  "id": "...",
  "text": "...",
  "tokens": 487,
  "section_title": "Chapter 1",
  "section_path": "Chapter 1 > Background",
  "section_level": 1,
  "page_start": 3,
  "page_end": 5,
  "previous_chunk_id": "...",
  "next_chunk_id": "..."
}

Integrations

Use adapters for common orchestration stacks:

langchain_docs = doc.to_langchain(max_tokens=500)
llama_nodes = doc.to_llamaindex(max_tokens=500)

Use vector payload exporters:

pinecone_records = doc.export_pinecone()
weaviate_records = doc.export_weaviate()
doc.export_faiss_jsonl("faiss_records.jsonl")

Architecture

  • docuweave/parser.py -> layout block extraction and cleanup
  • docuweave/hierarchy.py -> section tree construction
  • docuweave/chunking.py -> token-aware chunk generation
  • docuweave/integrations.py -> LangChain/LlamaIndex adapters
  • docuweave/vector_exporters.py -> vector DB payload builders
  • docuweave/api.py -> public API facade

Development

git clone https://github.com/venkateswararao18/docuweave.git
cd docuweave
pip install -e .

Run logic-focused tests:

python -m unittest tests/test_core_logic.py tests/test_integrations.py -v

Roadmap

  • DOCX and HTML support
  • More robust heading detection across noisy PDFs
  • Table extraction improvements
  • Optional semantic chunking mode
  • Benchmark suite and quality report

Contributing

Pull requests are welcome. Open an issue with a PDF sample when reporting parsing bugs.

Author

Support

  • Report bugs or request features in GitHub Issues
  • For package publishing and releases, use semantic versioning and keep README.md synced with shipped CLI/API features

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docuweave-0.1.3.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docuweave-0.1.3-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file docuweave-0.1.3.tar.gz.

File metadata

  • Download URL: docuweave-0.1.3.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for docuweave-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ebb0f4b9833821139ef865ea51ac07e02a3f43dda5a1ee86833f809e01a44ab3
MD5 ef483ef1d2eed22302701e8def66e47e
BLAKE2b-256 4cabe933be5dff581a919cafddd7873bbddf6f901ad48728b001a0dca067d78a

See more details on using hashes here.

File details

Details for the file docuweave-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: docuweave-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for docuweave-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ad55a20d83ca37063ebac00dce5ad0e3d701ed2a8ad302f5feef4319cd5567c9
MD5 9d31f493a48d19be50a6739adad338ee
BLAKE2b-256 5db82d0f131a39bd9f17194cb7d4a27443f316893e57884a7dd2a9c2e0033a8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page