PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API
Project description
PDFStract
The Data Preparation Layer for RAG — Extract. Chunk. Embed.
One unified API. Switch between 10+ extraction libraries, 10+ chunking methods, and multiple embedding providers with a single parameter change. Focus on your RAG outcomes, not library dependencies.
Installation
pip install pdfstract # Base - pymupdf4llm, markitdown
pip install pdfstract[standard] # + OCR (pytesseract, unstructured)
pip install pdfstract[advanced] # + ML-powered (marker, docling, paddleocr)
pip install pdfstract[all] # Everything
Python API
from pdfstract import PDFStract
pdfstract = PDFStract()
# Extract
text = pdfstract.convert('document.pdf', library='auto')
# Chunk
chunks = pdfstract.chunk(text, chunker='semantic', chunk_size=512)
# Embed
vectors = pdfstract.embed_texts([c['text'] for c in chunks['chunks']])
# Combined pipelines
result = pdfstract.convert_chunk('document.pdf', library='marker', chunker='token')
result = pdfstract.convert_chunk_embed('document.pdf', embedding='sentence-transformers')
Extract Examples
# Auto-select best available library
text = pdfstract.convert('document.pdf', library='auto')
# Use specific library
text = pdfstract.convert('document.pdf', library='marker')
text = pdfstract.convert('document.pdf', library='docling', output_format='json')
# Batch processing
results = pdfstract.batch_convert('./pdfs', library='pymupdf4llm', parallel_workers=4)
# Async
text = await pdfstract.convert_async('document.pdf', library='marker')
Chunk Examples
# Token-based chunking
chunks = pdfstract.chunk(text, chunker='token', chunk_size=512, chunk_overlap=50)
# Semantic chunking
chunks = pdfstract.chunk(text, chunker='semantic', chunk_size=1024)
# Code-aware chunking
chunks = pdfstract.chunk(code_text, chunker='code')
# Access results
for chunk in chunks['chunks']:
print(f"Chunk {chunk['chunk_id']}: {chunk['token_count']} tokens")
Embed Examples
# Embed multiple texts
vectors = pdfstract.embed_texts(["First text", "Second text"], model='sentence-transformers')
# Embed single text
vector = pdfstract.embed_text("Hello world", model='openai')
# List available providers
providers = pdfstract.list_available_embeddings()
CLI
pdfstract convert document.pdf --library marker
pdfstract convert-chunk document.pdf --chunker semantic
pdfstract convert-chunk-embed document.pdf --embedding sentence-transformers
pdfstract batch ./pdfs --parallel 4
What's Included
| Tier | Libraries |
|---|---|
| Base | pymupdf4llm, markitdown |
| Standard | + pytesseract, unstructured |
| Advanced | + marker, docling, paddleocr, deepseek |
Chunkers: token, sentence, semantic, recursive, code, and more
Embeddings: OpenAI, Azure, Google, Ollama, Sentence Transformers
Documentation
📖 pdfstract.com — Full docs, guides, and API reference
GitHub: github.com/aksarav/pdfstract · Issues · MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfstract-1.1.1.tar.gz.
File metadata
- Download URL: pdfstract-1.1.1.tar.gz
- Upload date:
- Size: 68.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b960da81b616f84f34e3bcaf87e988afd104b7ba7012d6eddcaf8af054750533
|
|
| MD5 |
9940875a925d296d510132314aafcc8b
|
|
| BLAKE2b-256 |
c6d3fff99f2c9c0ec9f168bea29cbd268e2ac1d766c91cf3376ff52df22482a7
|
File details
Details for the file pdfstract-1.1.1-py3-none-any.whl.
File metadata
- Download URL: pdfstract-1.1.1-py3-none-any.whl
- Upload date:
- Size: 78.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f861e442a557c20d4fd6d4eec06d023762b0e9a74cd56b4d4ebbd7bb7b2b769a
|
|
| MD5 |
2bf473b985b6127fdfa4064ec4aec795
|
|
| BLAKE2b-256 |
9324da3d6e491884f1f7d067b0ac4caf5bfc670a442819c3af837b6d73f6c50a
|