Skip to main content

A comprehensive document parser for RAG applications with support for PDF, DOCX, PPTX, XLSX, and more

Project description

RAG Parser

A comprehensive Python library for parsing documents into RAG-ready format. Supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, and more with intelligent chunking strategies.

🚀 Features

  • Universal Document Parsing: Support for PDF, DOCX, PPTX, XLSX, HTML, MD, CSV, JSON, and images
  • Intelligent Chunking: Multiple strategies (Fixed, Semantic, Adaptive) optimized for RAG
  • Metadata Extraction: Rich metadata including author, creation date, structure info
  • Content Structure Preservation: Maintains headers, tables, images, and formatting context
  • Async Support: Full async/await support for high-performance processing
  • RAG-Optimized Output: Ready-to-embed chunks with proper citations and context
  • Framework Integration: Built-in adapters for LangChain and LlamaIndex
  • Extensible Architecture: Easy to add custom parsers and chunking strategies

📦 Installation

Basic Installation

pip install ragparser

With Specific Format Support

# PDF support
pip install ragparser[pdf]

# Office documents (DOCX, PPTX, XLSX)
pip install ragparser[office]

# HTML parsing
pip install ragparser[html]

# OCR for images
pip install ragparser[ocr]

# All formats
pip install ragparser[all]

Development Installation

git clone https://github.com/shubham7995/ragparser.git
cd ragparser
pip install -e ".[dev]"

🎯 Quick Start

Basic Usage

from ragparser import RagParser
from ragparser.core.models import ParserConfig

# Initialize parser
parser = RagParser()

# Parse a document
result = parser.parse("document.pdf")

if result.success:
    document = result.document
    print(f"Extracted {len(document.content)} characters")
    print(f"Created {len(document.chunks)} chunks")
    print(f"Found {len(document.tables)} tables")
else:
    print(f"Error: {result.error}")

Advanced Configuration

from ragparser import RagParser, ParserConfig, ChunkingStrategy

# Custom configuration
config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    extract_tables=True,
    extract_images=True,
    clean_text=True
)

parser = RagParser(config)
result = parser.parse("complex_document.pdf")

Async Processing

import asyncio
from ragparser import RagParser

async def process_documents():
    parser = RagParser()
    
    # Process single document
    result = await parser.parse_async("document.pdf")
    
    # Process multiple documents concurrently
    files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
    results = await parser.parse_multiple_async(files)
    
    for result in results:
        if result.success:
            print(f"Processed: {result.document.metadata.file_name}")

asyncio.run(process_documents())

Processing from Bytes

# Parse document from bytes (e.g., from web upload)
with open("document.pdf", "rb") as f:
    data = f.read()

result = parser.parse_from_bytes(data, "document.pdf")

📚 Supported Formats

Format Extensions Features
PDF .pdf Text, images, tables, metadata, OCR
Word .docx Text, formatting, tables, images, comments
PowerPoint .pptx Slides, speaker notes, images, tables
Excel .xlsx Sheets, formulas, charts, named ranges
HTML .html, .htm Structure, links, images, tables
Markdown .md, .markdown Headers, code blocks, tables, links
Text .txt Plain text with encoding detection
CSV .csv Structured data with header detection
JSON .json Structured data parsing
Images .png, .jpg, .gif, etc. OCR text extraction

🔧 Chunking Strategies

Fixed Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.FIXED,
    chunk_size=1000,
    chunk_overlap=200
)
  • Splits text into fixed-size chunks
  • Preserves sentence boundaries
  • Configurable overlap for context

Semantic Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000
)
  • Groups content by semantic meaning
  • Respects document structure (headers, paragraphs)
  • Maintains topic coherence

Adaptive Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.ADAPTIVE,
    chunk_size=1000
)
  • Dynamically adjusts chunk size based on content
  • Optimizes for embedding model context windows
  • Balances size and semantic coherence

🔍 Content Extraction

Text and Structure

# Access extracted content
document = result.document

# Full text content
print(document.content)

# Structured content blocks
for block in document.content_blocks:
    print(f"{block.block_type}: {block.content}")

# Chunked content ready for RAG
for chunk in document.chunks:
    print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")

Tables and Data

# Extract tables
for table in document.tables:
    print(f"Table with {len(table['data'])} rows")
    headers = table.get('headers', [])
    print(f"Headers: {headers}")

Metadata

meta = document.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Pages: {meta.page_count}")
print(f"Words: {meta.word_count}")

🔗 Framework Integration

LangChain Integration

from ragparser.integrations.langchain_adapter import RagParserLoader

# Use as a LangChain document loader
loader = RagParserLoader("documents/")
documents = loader.load()

# With custom config
config = ParserConfig(chunking_strategy=ChunkingStrategy.SEMANTIC)
loader = RagParserLoader("documents/", config=config)
documents = loader.load()

LlamaIndex Integration

from ragparser.integrations.llamaindex_adapter import RagParserReader

# Use as a LlamaIndex reader
reader = RagParserReader()
documents = reader.load_data("document.pdf")

⚙️ Configuration Options

Parser Configuration

config = ParserConfig(
    # Chunking settings
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    
    # Content extraction
    extract_tables=True,
    extract_images=True,
    extract_metadata=True,
    extract_links=True,
    
    # Text processing
    clean_text=True,
    preserve_formatting=False,
    merge_paragraphs=True,
    
    # OCR settings
    enable_ocr=True,
    ocr_language="eng",
    ocr_confidence_threshold=0.7,
    
    # Performance
    max_file_size=100 * 1024 * 1024,  # 100MB
    timeout_seconds=300,
)

Runtime Configuration Updates

parser = RagParser()

# Update specific settings
parser.update_config(
    chunk_size=1500,
    extract_tables=False
)

# Add custom settings
parser.update_config(
    custom_ocr_model="my_model",
    special_processing=True
)

🚀 Performance Features

  • Async Processing: Non-blocking document processing
  • Concurrent Parsing: Process multiple documents simultaneously
  • Memory Efficient: Streaming for large files
  • Caching: Avoid reprocessing identical content
  • Lazy Loading: Only load parsers for formats you use

📊 Monitoring and Quality

Processing Statistics

result = parser.parse("document.pdf")

stats = result.processing_stats
print(f"Processing time: {stats['processing_time']:.2f}s")
print(f"File size: {stats['file_size']} bytes")
print(f"Chunks created: {stats['chunk_count']}")

Quality Metrics

document = result.document

# Content quality indicators
print(f"Quality score: {document.quality_score}")
print(f"Extraction notes: {document.extraction_notes}")

# Chunk quality
for chunk in document.chunks:
    print(f"Chunk tokens: {chunk.token_count}")
    print(f"Embedding ready: {chunk.embedding_ready}")

🧪 Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=ragparser

# Run only fast tests
pytest -m "not slow"

# Run integration tests
pytest -m integration

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

Development Setup

# Clone repository
git clone https://github.com/shubham7995/ragparser.git
cd ragparser

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

Adding New Parsers

from ragparser.parsers.base import BaseParser
from ragparser.core.models import ParsedDocument, FileType

class MyCustomParser(BaseParser):
    def __init__(self):
        super().__init__()
        self.supported_formats = [FileType.CUSTOM]
    
    async def parse_async(self, file_path, config):
        # Implement parsing logic
        return ParsedDocument(...)

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

🏷️ Keywords

RAG, document parsing, PDF, DOCX, PPTX, XLSX, chunking, embedding, LangChain, LlamaIndex, async, OCR, metadata extraction


Built with ❤️ for the RAG and LLM community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragparser-1.0.2.tar.gz (57.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragparser-1.0.2-py3-none-any.whl (60.6 kB view details)

Uploaded Python 3

File details

Details for the file ragparser-1.0.2.tar.gz.

File metadata

  • Download URL: ragparser-1.0.2.tar.gz
  • Upload date:
  • Size: 57.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for ragparser-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c98246a5bef783f1c217aa6490de5192637454f058d13af4f9aa052ef04424b9
MD5 978c10e823022d99fb77d0eecd6bfa87
BLAKE2b-256 9c11571a2bd86ae0978a831925c640d349d205b75edebbc62ae889e90d4f8919

See more details on using hashes here.

File details

Details for the file ragparser-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ragparser-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 60.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for ragparser-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a52a3bf4058c87bafab5664b37a6341674e8b518322df0c019a521ad127e3abe
MD5 88dbdeb58bcd6b807eccd2425e8ae9a9
BLAKE2b-256 960166b79d3895d697165877f977cbdc31366e768bb4174be9d2d99a74b204ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page