A comprehensive document parser for RAG applications with support for PDF, DOCX, PPTX, XLSX, and more

These details have not been verified by PyPI

Project links

Project description

RAG Parser

A comprehensive Python library for parsing documents into RAG-ready format. Supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, and more with intelligent chunking strategies.

🚀 Features

Universal Document Parsing: Support for PDF, DOCX, PPTX, XLSX, HTML, MD, CSV, JSON, and images
Intelligent Chunking: Multiple strategies (Fixed, Semantic, Adaptive) optimized for RAG
Metadata Extraction: Rich metadata including author, creation date, structure info
Content Structure Preservation: Maintains headers, tables, images, and formatting context
Async Support: Full async/await support for high-performance processing
RAG-Optimized Output: Ready-to-embed chunks with proper citations and context
Framework Integration: Built-in adapters for LangChain and LlamaIndex
Extensible Architecture: Easy to add custom parsers and chunking strategies

📦 Installation

Basic Installation

pip install ragparser

With Specific Format Support

# PDF support
pip install ragparser[pdf]

# Office documents (DOCX, PPTX, XLSX)
pip install ragparser[office]

# HTML parsing
pip install ragparser[html]

# OCR for images
pip install ragparser[ocr]

# All formats
pip install ragparser[all]

Development Installation

git clone https://github.com/shubham7995/ragparser.git
cd ragparser
pip install -e ".[dev]"

🎯 Quick Start

Basic Usage

from ragparser import RagParser
from ragparser.core.models import ParserConfig

# Initialize parser
parser = RagParser()

# Parse a document
result = parser.parse("document.pdf")

if result.success:
    document = result.document
    print(f"Extracted {len(document.content)} characters")
    print(f"Created {len(document.chunks)} chunks")
    print(f"Found {len(document.tables)} tables")
else:
    print(f"Error: {result.error}")

Advanced Configuration

from ragparser import RagParser, ParserConfig, ChunkingStrategy

# Custom configuration
config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    extract_tables=True,
    extract_images=True,
    clean_text=True
)

parser = RagParser(config)
result = parser.parse("complex_document.pdf")

Async Processing

import asyncio
from ragparser import RagParser

async def process_documents():
    parser = RagParser()
    
    # Process single document
    result = await parser.parse_async("document.pdf")
    
    # Process multiple documents concurrently
    files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
    results = await parser.parse_multiple_async(files)
    
    for result in results:
        if result.success:
            print(f"Processed: {result.document.metadata.file_name}")

asyncio.run(process_documents())

Processing from Bytes

# Parse document from bytes (e.g., from web upload)
with open("document.pdf", "rb") as f:
    data = f.read()

result = parser.parse_from_bytes(data, "document.pdf")

📚 Supported Formats

Format	Extensions	Features
PDF	`.pdf`	Text, images, tables, metadata, OCR
Word	`.docx`	Text, formatting, tables, images, comments
PowerPoint	`.pptx`	Slides, speaker notes, images, tables
Excel	`.xlsx`	Sheets, formulas, charts, named ranges
HTML	`.html`, `.htm`	Structure, links, images, tables
Markdown	`.md`, `.markdown`	Headers, code blocks, tables, links
Text	`.txt`	Plain text with encoding detection
CSV	`.csv`	Structured data with header detection
JSON	`.json`	Structured data parsing
Images	`.png`, `.jpg`, `.gif`, etc.	OCR text extraction

🔧 Chunking Strategies

Fixed Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.FIXED,
    chunk_size=1000,
    chunk_overlap=200
)

Splits text into fixed-size chunks
Preserves sentence boundaries
Configurable overlap for context

Semantic Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000
)

Groups content by semantic meaning
Respects document structure (headers, paragraphs)
Maintains topic coherence

Adaptive Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.ADAPTIVE,
    chunk_size=1000
)

Dynamically adjusts chunk size based on content
Optimizes for embedding model context windows
Balances size and semantic coherence

🔍 Content Extraction

Text and Structure

# Access extracted content
document = result.document

# Full text content
print(document.content)

# Structured content blocks
for block in document.content_blocks:
    print(f"{block.block_type}: {block.content}")

# Chunked content ready for RAG
for chunk in document.chunks:
    print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")

Tables and Data

# Extract tables
for table in document.tables:
    print(f"Table with {len(table['data'])} rows")
    headers = table.get('headers', [])
    print(f"Headers: {headers}")

Metadata

meta = document.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Pages: {meta.page_count}")
print(f"Words: {meta.word_count}")

🔗 Framework Integration

LangChain Integration

from ragparser.integrations.langchain_adapter import RagParserLoader

# Use as a LangChain document loader
loader = RagParserLoader("documents/")
documents = loader.load()

# With custom config
config = ParserConfig(chunking_strategy=ChunkingStrategy.SEMANTIC)
loader = RagParserLoader("documents/", config=config)
documents = loader.load()

LlamaIndex Integration

from ragparser.integrations.llamaindex_adapter import RagParserReader

# Use as a LlamaIndex reader
reader = RagParserReader()
documents = reader.load_data("document.pdf")

⚙️ Configuration Options

Parser Configuration

config = ParserConfig(
    # Chunking settings
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    
    # Content extraction
    extract_tables=True,
    extract_images=True,
    extract_metadata=True,
    extract_links=True,
    
    # Text processing
    clean_text=True,
    preserve_formatting=False,
    merge_paragraphs=True,
    
    # OCR settings
    enable_ocr=True,
    ocr_language="eng",
    ocr_confidence_threshold=0.7,
    
    # Performance
    max_file_size=100 * 1024 * 1024,  # 100MB
    timeout_seconds=300,
)

Runtime Configuration Updates

parser = RagParser()

# Update specific settings
parser.update_config(
    chunk_size=1500,
    extract_tables=False
)

# Add custom settings
parser.update_config(
    custom_ocr_model="my_model",
    special_processing=True
)

🚀 Performance Features

Async Processing: Non-blocking document processing
Concurrent Parsing: Process multiple documents simultaneously
Memory Efficient: Streaming for large files
Caching: Avoid reprocessing identical content
Lazy Loading: Only load parsers for formats you use

📊 Monitoring and Quality

Processing Statistics

result = parser.parse("document.pdf")

stats = result.processing_stats
print(f"Processing time: {stats['processing_time']:.2f}s")
print(f"File size: {stats['file_size']} bytes")
print(f"Chunks created: {stats['chunk_count']}")

Quality Metrics

document = result.document

# Content quality indicators
print(f"Quality score: {document.quality_score}")
print(f"Extraction notes: {document.extraction_notes}")

# Chunk quality
for chunk in document.chunks:
    print(f"Chunk tokens: {chunk.token_count}")
    print(f"Embedding ready: {chunk.embedding_ready}")

🧪 Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=ragparser

# Run only fast tests
pytest -m "not slow"

# Run integration tests
pytest -m integration

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

Development Setup

# Clone repository
git clone https://github.com/shubham7995/ragparser.git
cd ragparser

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

Adding New Parsers

from ragparser.parsers.base import BaseParser
from ragparser.core.models import ParsedDocument, FileType

class MyCustomParser(BaseParser):
    def __init__(self):
        super().__init__()
        self.supported_formats = [FileType.CUSTOM]
    
    async def parse_async(self, file_path, config):
        # Implement parsing logic
        return ParsedDocument(...)

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

GitHub: https://github.com/shubham7995/ragparser
PyPI: https://pypi.org/project/ragparser/
Documentation: https://ragparser.readthedocs.io/
Issues: https://github.com/shubham7995/ragparser/issues

🏷️ Keywords

RAG, document parsing, PDF, DOCX, PPTX, XLSX, chunking, embedding, LangChain, LlamaIndex, async, OCR, metadata extraction

Built with ❤️ for the RAG and LLM community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.5

Jul 25, 2025

1.0.4

Jul 25, 2025

1.0.3

Jul 24, 2025

This version

1.0.2

Jul 24, 2025

1.0.1

Jul 24, 2025

1.0.0

Jul 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragparser-1.0.2.tar.gz (57.2 kB view details)

Uploaded Jul 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragparser-1.0.2-py3-none-any.whl (60.6 kB view details)

Uploaded Jul 24, 2025 Python 3

File details

Details for the file ragparser-1.0.2.tar.gz.

File metadata

Download URL: ragparser-1.0.2.tar.gz
Upload date: Jul 24, 2025
Size: 57.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for ragparser-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c98246a5bef783f1c217aa6490de5192637454f058d13af4f9aa052ef04424b9`
MD5	`978c10e823022d99fb77d0eecd6bfa87`
BLAKE2b-256	`9c11571a2bd86ae0978a831925c640d349d205b75edebbc62ae889e90d4f8919`

See more details on using hashes here.

File details

Details for the file ragparser-1.0.2-py3-none-any.whl.

File metadata

Download URL: ragparser-1.0.2-py3-none-any.whl
Upload date: Jul 24, 2025
Size: 60.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for ragparser-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a52a3bf4058c87bafab5664b37a6341674e8b518322df0c019a521ad127e3abe`
MD5	`88dbdeb58bcd6b807eccd2425e8ae9a9`
BLAKE2b-256	`960166b79d3895d697165877f977cbdc31366e768bb4174be9d2d99a74b204ee`

See more details on using hashes here.

ragparser 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG Parser

🚀 Features

📦 Installation

Basic Installation

With Specific Format Support

Development Installation

🎯 Quick Start

Basic Usage

Advanced Configuration

Async Processing

Processing from Bytes

📚 Supported Formats

🔧 Chunking Strategies

Fixed Chunking

Semantic Chunking

Adaptive Chunking

🔍 Content Extraction

Text and Structure

Tables and Data

Metadata

🔗 Framework Integration

LangChain Integration

LlamaIndex Integration

⚙️ Configuration Options

Parser Configuration

Runtime Configuration Updates

🚀 Performance Features

📊 Monitoring and Quality

Processing Statistics

Quality Metrics

🧪 Testing

🤝 Contributing

Development Setup

Adding New Parsers

📝 License

🔗 Links

🏷️ Keywords

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes