Skip to main content

🦀 Blazingly fast text parsing library with Rust backend - 10x faster than pure Python with support for PDF, DOCX, CSV and 12+ programming languages

Project description

CrabParser 🦀

High-performance text parsing library written in Rust with Python bindings

License: MIT PyPI version Python versions

CrabParser is a blazingly fast text parsing library that splits documents and code files into semantic chunks. Built with Rust for maximum performance and memory efficiency, it provides Python bindings for easy integration into your projects.

Key Features

  • 🚀 Pure Rust Performance - 10x faster than pure Python implementations
  • 📄 Multi-Format Support - Handles TXT, PDF, DOCX, CSV, and 12+ programming languages
  • 🛡️ Bulletproof Encoding - Never fails on any file, handles all encodings gracefully
  • 💾 Memory Efficient - ChunkedText keeps data in Rust memory with Python access
  • Parallel Processing - Leverages Rayon for concurrent operations
  • 🧩 Semantic Chunking - Respects document structure (paragraphs, sentences, code blocks)

Installation

pip install crabparser

Quick Start

from crabparser import TextParser, ChunkedText

# Create a parser instance
parser = TextParser(
    chunk_size=1000,          # Maximum characters per chunk
    respect_paragraphs=True,  # Keep paragraphs together
    respect_sentences=True    # Split at sentence boundaries
)

# Parse text
text = "Your long document text here..."
chunks = parser.parse(text)
print(f"Split into {len(chunks)} chunks")

# Parse with memory-efficient ChunkedText
chunked = parser.parse_chunked(text)
print(f"First chunk: {chunked[0]}")
print(f"Total size: {chunked.total_size} bytes")

# Parse files directly (auto-detects format)
chunks = parser.parse_file("document.pdf")  # Works with PDF, DOCX, CSV, TXT, and code files

# Save chunks to files
parser.save_chunks(chunks, "output_dir", "document")

Supported Formats

Documents

  • PDF - Extracts and preserves text content
  • DOCX - Full support for Word documents
  • CSV - Intelligent handling of structured data
  • TXT - Universal text files with encoding detection

Programming Languages

Semantic code parsing that respects function and class boundaries:

  • Python, JavaScript, TypeScript, Rust
  • Go, Java, C#, C++
  • Ruby, PHP, Swift, Kotlin

API Reference

TextParser

The main parser class for processing text and files.

parser = TextParser(
    chunk_size=1000,          # Maximum size of each chunk
    respect_paragraphs=True,  # Maintain paragraph boundaries
    respect_sentences=True    # Maintain sentence boundaries
)

Methods:

  • parse(text: str) -> List[str] - Parse text into chunks
  • parse_chunked(text: str) -> ChunkedText - Memory-efficient parsing
  • parse_file(path: str) -> List[str] - Parse any supported file
  • parse_file_chunked(path: str) -> ChunkedText - Memory-efficient file parsing
  • save_chunks(chunks, output_dir, base_name) -> int - Save chunks to files

ChunkedText

Memory-efficient container that keeps chunks in Rust memory.

# Access chunks without loading all into Python memory
chunked[0]            # First chunk
chunked[-1]           # Last chunk
len(chunked)          # Number of chunks
chunked.total_size    # Total size in bytes
chunked.source_file   # Source file path (if applicable)

# Iteration
for chunk in chunked:
    process(chunk)

# Get slice of chunks
batch = chunked.get_slice(0, 10)  # Get first 10 chunks

Advanced Examples

Processing Large PDFs

from crabparser import TextParser

parser = TextParser(chunk_size=2000)

# Parse a large PDF file
chunks = parser.parse_file("research_paper.pdf")

# Process chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

Code File Parsing

# Parse Python code while respecting function boundaries
parser = TextParser(
    chunk_size=1500,
    respect_paragraphs=True  # Keeps functions/classes together
)

chunks = parser.parse_file("main.py")

Batch Processing with Memory Efficiency

from pathlib import Path
from crabparser import TextParser

parser = TextParser(chunk_size=1000)
output_base = Path("output")

for file_path in Path("documents").glob("*.pdf"):
    # Use memory-efficient parsing
    chunked = parser.parse_file_chunked(str(file_path))

    # Process without loading all chunks into memory
    for i in range(len(chunked)):
        chunk = chunked[i]  # Only loads this chunk
        # Process chunk...

    # Save results
    parser.save_chunks(chunked, str(output_base), file_path.stem)

Performance

CrabParser is designed for speed and efficiency:

  • 10x faster than pure Python text processing
  • Parallel chunk processing using Rayon
  • Zero-copy operations where possible
  • Memory-efficient chunk streaming

Links

License

MIT License - see the LICENSE file for details.


Made with 🦀 and ❤️ by the open-source community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crabparser-0.1.1.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ x86-64

File details

Details for the file crabparser-0.1.1.tar.gz.

File metadata

  • Download URL: crabparser-0.1.1.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crabparser-0.1.1.tar.gz
Algorithm Hash digest
SHA256 00a2b16f28e54ecb96455ab9d74f77c3ac8e67fa485855b00eae102e0d70b68c
MD5 1cd9e4aeae36860d0a6bc467872908c3
BLAKE2b-256 590b8a06e74d1f8b2759e8f735b2114841ab5f2f2c6222ab27c3cca0bcea8b21

See more details on using hashes here.

File details

Details for the file crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6b8d70bd8f2096ab3ad97193433bb2eaa9b84d3c8dfe338a6874a0992bf4a5d9
MD5 fce92ee9e2ce6b80893a375c9834ed9a
BLAKE2b-256 f8033197ef3e8e6cc9fad9785116b47e92d6dfc6625c13fe1aa29621675fc4d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page