Layout-aware document parser for structured LLM-ready JSON
Project description
DocuWeave
RAG document compiler for PDFs: layout-aware parsing -> section hierarchy -> token-aware chunks -> vector-ready outputs.
DocuWeave helps you convert raw PDFs into structured context that performs better in retrieval pipelines.
Why DocuWeave
Most basic PDF loaders return flat text and lose structure. DocuWeave preserves document shape so retrieval can be more accurate and explainable.
- Layout-aware block parsing using PyMuPDF
- Automatic hierarchy construction from heading signals
- Token-aware chunking for embedding workflows
- Rich chunk metadata (
section_path, page span, chunk links) - Export paths for Pinecone, Weaviate, and FAISS-style JSONL
- LangChain and LlamaIndex-friendly adapters
Installation
pip install docuweave
Requires Python 3.9+.
Install optional integration dependencies:
pip install "docuweave[integrations]"
Quick Start (Python)
from docuweave import parse
doc = parse("sample.pdf")
chunks = doc.to_chunks(max_tokens=500)
doc.save_json("output.json")
print(len(doc.get_sections()), len(chunks))
Quick Start (CLI)
docuweave sample.pdf -o output.json --max-tokens 500
Vector export modes:
docuweave sample.pdf --export pinecone -o pinecone_records.json
docuweave sample.pdf --export weaviate -o weaviate_records.json
docuweave sample.pdf --export faiss-jsonl -o faiss_records.jsonl
Output Shape
Chunks include retrieval-friendly metadata:
{
"id": "...",
"text": "...",
"tokens": 487,
"section_title": "Chapter 1",
"section_path": "Chapter 1 > Background",
"section_level": 1,
"page_start": 3,
"page_end": 5,
"previous_chunk_id": "...",
"next_chunk_id": "..."
}
Integrations
Use adapters for common orchestration stacks:
langchain_docs = doc.to_langchain(max_tokens=500)
llama_nodes = doc.to_llamaindex(max_tokens=500)
Use vector payload exporters:
pinecone_records = doc.export_pinecone()
weaviate_records = doc.export_weaviate()
doc.export_faiss_jsonl("faiss_records.jsonl")
Architecture
docuweave/parser.py-> layout block extraction and cleanupdocuweave/hierarchy.py-> section tree constructiondocuweave/chunking.py-> token-aware chunk generationdocuweave/integrations.py-> LangChain/LlamaIndex adaptersdocuweave/vector_exporters.py-> vector DB payload buildersdocuweave/api.py-> public API facade
Development
git clone https://github.com/venkateswararao18/docuweave.git
cd docuweave
pip install -e .
Run logic-focused tests:
python -m unittest tests/test_core_logic.py tests/test_integrations.py -v
Roadmap
- DOCX and HTML support
- More robust heading detection across noisy PDFs
- Table extraction improvements
- Optional semantic chunking mode
- Benchmark suite and quality report
Contributing
Pull requests are welcome. Open an issue with a PDF sample when reporting parsing bugs.
Author
- Venkateswara Rao Jannegorla
- GitHub: VenkateswaraRao18
- Email: mrvenky18@gmail.com
Support
- Report bugs or request features in GitHub Issues
- For package publishing and releases, use semantic versioning and keep
README.mdsynced with shipped CLI/API features
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docuweave-0.1.3.tar.gz.
File metadata
- Download URL: docuweave-0.1.3.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebb0f4b9833821139ef865ea51ac07e02a3f43dda5a1ee86833f809e01a44ab3
|
|
| MD5 |
ef483ef1d2eed22302701e8def66e47e
|
|
| BLAKE2b-256 |
4cabe933be5dff581a919cafddd7873bbddf6f901ad48728b001a0dca067d78a
|
File details
Details for the file docuweave-0.1.3-py3-none-any.whl.
File metadata
- Download URL: docuweave-0.1.3-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad55a20d83ca37063ebac00dce5ad0e3d701ed2a8ad302f5feef4319cd5567c9
|
|
| MD5 |
9d31f493a48d19be50a6739adad338ee
|
|
| BLAKE2b-256 |
5db82d0f131a39bd9f17194cb7d4a27443f316893e57884a7dd2a9c2e0033a8d
|