Skip to main content

A comprehensive document parser for RAG applications with support for PDF, DOCX, PPTX, XLSX, and more

Project description

RAG Parser

A comprehensive Python library for parsing documents into RAG-ready format. Supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, and more with intelligent chunking strategies.

🚀 Features

  • Universal Document Parsing: Support for PDF, DOCX, PPTX, XLSX, HTML, MD, CSV, JSON, and images
  • Intelligent Chunking: Multiple strategies (Fixed, Semantic, Adaptive) optimized for RAG
  • Metadata Extraction: Rich metadata including author, creation date, structure info
  • Content Structure Preservation: Maintains headers, tables, images, and formatting context
  • Async Support: Full async/await support for high-performance processing
  • RAG-Optimized Output: Ready-to-embed chunks with proper citations and context
  • Framework Integration: Built-in adapters for LangChain and LlamaIndex
  • Extensible Architecture: Easy to add custom parsers and chunking strategies

📦 Installation

Basic Installation

pip install ragparser

With Specific Format Support

# PDF support
pip install ragparser[pdf]

# Office documents (DOCX, PPTX, XLSX)
pip install ragparser[office]

# HTML parsing
pip install ragparser[html]

# OCR for images
pip install ragparser[ocr]

# All formats
pip install ragparser[all]

Development Installation

git clone https://github.com/shubham7995/ragparser.git
cd ragparser
pip install -e ".[dev]"

🎯 Quick Start

Basic Usage

from ragparser import RagParser
from ragparser.core.models import ParserConfig

# Initialize parser
parser = RagParser()

# Parse a document
result = parser.parse("document.pdf")

if result.success:
    document = result.document
    print(f"Extracted {len(document.content)} characters")
    print(f"Created {len(document.chunks)} chunks")
    print(f"Found {len(document.tables)} tables")
else:
    print(f"Error: {result.error}")

Advanced Configuration

from ragparser import RagParser, ParserConfig, ChunkingStrategy

# Custom configuration
config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    extract_tables=True,
    extract_images=True,
    clean_text=True
)

parser = RagParser(config)
result = parser.parse("complex_document.pdf")

Async Processing

import asyncio
from ragparser import RagParser

async def process_documents():
    parser = RagParser()
    
    # Process single document
    result = await parser.parse_async("document.pdf")
    
    # Process multiple documents concurrently
    files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
    results = await parser.parse_multiple_async(files)
    
    for result in results:
        if result.success:
            print(f"Processed: {result.document.metadata.file_name}")

asyncio.run(process_documents())

Processing from Bytes

# Parse document from bytes (e.g., from web upload)
with open("document.pdf", "rb") as f:
    data = f.read()

result = parser.parse_from_bytes(data, "document.pdf")

📚 Supported Formats

Format Extensions Features
PDF .pdf Text, images, tables, metadata, OCR
Word .docx Text, formatting, tables, images, comments
PowerPoint .pptx Slides, speaker notes, images, tables
Excel .xlsx Sheets, formulas, charts, named ranges
HTML .html, .htm Structure, links, images, tables
Markdown .md, .markdown Headers, code blocks, tables, links
Text .txt Plain text with encoding detection
CSV .csv Structured data with header detection
JSON .json Structured data parsing
Images .png, .jpg, .gif, etc. OCR text extraction

🔧 Chunking Strategies

Fixed Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.FIXED,
    chunk_size=1000,
    chunk_overlap=200
)
  • Splits text into fixed-size chunks
  • Preserves sentence boundaries
  • Configurable overlap for context

Semantic Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000
)
  • Groups content by semantic meaning
  • Respects document structure (headers, paragraphs)
  • Maintains topic coherence

Adaptive Chunking

config = ParserConfig(
    chunking_strategy=ChunkingStrategy.ADAPTIVE,
    chunk_size=1000
)
  • Dynamically adjusts chunk size based on content
  • Optimizes for embedding model context windows
  • Balances size and semantic coherence

🔍 Content Extraction

Text and Structure

# Access extracted content
document = result.document

# Full text content
print(document.content)

# Structured content blocks
for block in document.content_blocks:
    print(f"{block.block_type}: {block.content}")

# Chunked content ready for RAG
for chunk in document.chunks:
    print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")

Tables and Data

# Extract tables
for table in document.tables:
    print(f"Table with {len(table['data'])} rows")
    headers = table.get('headers', [])
    print(f"Headers: {headers}")

Metadata

meta = document.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Pages: {meta.page_count}")
print(f"Words: {meta.word_count}")

🔗 Framework Integration

LangChain Integration

from ragparser.integrations.langchain_adapter import RagParserLoader

# Use as a LangChain document loader
loader = RagParserLoader("documents/")
documents = loader.load()

# With custom config
config = ParserConfig(chunking_strategy=ChunkingStrategy.SEMANTIC)
loader = RagParserLoader("documents/", config=config)
documents = loader.load()

LlamaIndex Integration

from ragparser.integrations.llamaindex_adapter import RagParserReader

# Use as a LlamaIndex reader
reader = RagParserReader()
documents = reader.load_data("document.pdf")

⚙️ Configuration Options

Parser Configuration

config = ParserConfig(
    # Chunking settings
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    
    # Content extraction
    extract_tables=True,
    extract_images=True,
    extract_metadata=True,
    extract_links=True,
    
    # Text processing
    clean_text=True,
    preserve_formatting=False,
    merge_paragraphs=True,
    
    # OCR settings
    enable_ocr=True,
    ocr_language="eng",
    ocr_confidence_threshold=0.7,
    
    # Performance
    max_file_size=100 * 1024 * 1024,  # 100MB
    timeout_seconds=300,
)

Runtime Configuration Updates

parser = RagParser()

# Update specific settings
parser.update_config(
    chunk_size=1500,
    extract_tables=False
)

# Add custom settings
parser.update_config(
    custom_ocr_model="my_model",
    special_processing=True
)

🚀 Performance Features

  • Async Processing: Non-blocking document processing
  • Concurrent Parsing: Process multiple documents simultaneously
  • Memory Efficient: Streaming for large files
  • Caching: Avoid reprocessing identical content
  • Lazy Loading: Only load parsers for formats you use

📊 Monitoring and Quality

Processing Statistics

result = parser.parse("document.pdf")

stats = result.processing_stats
print(f"Processing time: {stats['processing_time']:.2f}s")
print(f"File size: {stats['file_size']} bytes")
print(f"Chunks created: {stats['chunk_count']}")

Quality Metrics

document = result.document

# Content quality indicators
print(f"Quality score: {document.quality_score}")
print(f"Extraction notes: {document.extraction_notes}")

# Chunk quality
for chunk in document.chunks:
    print(f"Chunk tokens: {chunk.token_count}")
    print(f"Embedding ready: {chunk.embedding_ready}")

🧪 Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=ragparser

# Run only fast tests
pytest -m "not slow"

# Run integration tests
pytest -m integration

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

Development Setup

# Clone repository
git clone https://github.com/shubham7995/ragparser.git
cd ragparser

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

Adding New Parsers

from ragparser.parsers.base import BaseParser
from ragparser.core.models import ParsedDocument, FileType

class MyCustomParser(BaseParser):
    def __init__(self):
        super().__init__()
        self.supported_formats = [FileType.CUSTOM]
    
    async def parse_async(self, file_path, config):
        # Implement parsing logic
        return ParsedDocument(...)

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

🏷️ Keywords

RAG, document parsing, PDF, DOCX, PPTX, XLSX, chunking, embedding, LangChain, LlamaIndex, async, OCR, metadata extraction


Built with ❤️ for the RAG and LLM community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragparser-1.0.4.tar.gz (57.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragparser-1.0.4-py3-none-any.whl (60.8 kB view details)

Uploaded Python 3

File details

Details for the file ragparser-1.0.4.tar.gz.

File metadata

  • Download URL: ragparser-1.0.4.tar.gz
  • Upload date:
  • Size: 57.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for ragparser-1.0.4.tar.gz
Algorithm Hash digest
SHA256 e6196b3f81f51aa7d3db17c9592bce3605fe901c0ac8a86f54224ae515c3e1d0
MD5 0587d76e9f895a53dcd5e8af112a7489
BLAKE2b-256 c1fa489c331cfc64e2ce7471f51f66dc97eaa5ab4d6aaa6d98507f2468314e3f

See more details on using hashes here.

File details

Details for the file ragparser-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: ragparser-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 60.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for ragparser-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 945de063f2f515e01048ba1ab347e39b1cce13a24cb6fd4c8ad30cc4a3a35883
MD5 c615f32f5e283aa2aa927468bc19ad7f
BLAKE2b-256 e68cb4d3e8d82892f46eb8fd0dbc68269d6a749f71c67606ad1bb9fdb8922f66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page