🦀 Blazingly fast text parsing library with Rust backend - 10x faster than pure Python with support for PDF, DOCX, CSV and 12+ programming languages

These details have not been verified by PyPI

Project links

Project description

CrabParser 🦀

High-performance text parsing library written in Rust with Python bindings

CrabParser is a blazingly fast text parsing library that splits documents and code files into semantic chunks. Built with Rust for maximum performance and memory efficiency, it provides Python bindings for easy integration into your projects.

Key Features

🚀 Pure Rust Performance - 10x faster than pure Python implementations
📄 Multi-Format Support - Handles TXT, PDF, DOCX, CSV, and 12+ programming languages
🛡️ Bulletproof Encoding - Never fails on any file, handles all encodings gracefully
💾 Memory Efficient - ChunkedText keeps data in Rust memory with Python access
⚡ Parallel Processing - Leverages Rayon for concurrent operations
🧩 Semantic Chunking - Respects document structure (paragraphs, sentences, code blocks)

Installation

pip install crabparser

Quick Start

from crabparser import TextParser, ChunkedText

# Create a parser instance
parser = TextParser(
    chunk_size=1000,          # Maximum characters per chunk
    respect_paragraphs=True,  # Keep paragraphs together
    respect_sentences=True    # Split at sentence boundaries
)

# Parse text
text = "Your long document text here..."
chunks = parser.parse(text)
print(f"Split into {len(chunks)} chunks")

# Parse with memory-efficient ChunkedText
chunked = parser.parse_chunked(text)
print(f"First chunk: {chunked[0]}")
print(f"Total size: {chunked.total_size} bytes")

# Parse files directly (auto-detects format)
chunks = parser.parse_file("document.pdf")  # Works with PDF, DOCX, CSV, TXT, and code files

# Save chunks to files
parser.save_chunks(chunks, "output_dir", "document")

Supported Formats

Documents

PDF - Extracts and preserves text content
DOCX - Full support for Word documents
CSV - Intelligent handling of structured data
TXT - Universal text files with encoding detection

Programming Languages

Semantic code parsing that respects function and class boundaries:

Python, JavaScript, TypeScript, Rust
Go, Java, C#, C++
Ruby, PHP, Swift, Kotlin

API Reference

TextParser

The main parser class for processing text and files.

parser = TextParser(
    chunk_size=1000,          # Maximum size of each chunk
    respect_paragraphs=True,  # Maintain paragraph boundaries
    respect_sentences=True    # Maintain sentence boundaries
)

Methods:

parse(text: str) -> List[str] - Parse text into chunks
parse_chunked(text: str) -> ChunkedText - Memory-efficient parsing
parse_file(path: str) -> List[str] - Parse any supported file
parse_file_chunked(path: str) -> ChunkedText - Memory-efficient file parsing
save_chunks(chunks, output_dir, base_name) -> int - Save chunks to files

ChunkedText

Memory-efficient container that keeps chunks in Rust memory.

# Access chunks without loading all into Python memory
chunked[0]            # First chunk
chunked[-1]           # Last chunk
len(chunked)          # Number of chunks
chunked.total_size    # Total size in bytes
chunked.source_file   # Source file path (if applicable)

# Iteration
for chunk in chunked:
    process(chunk)

# Get slice of chunks
batch = chunked.get_slice(0, 10)  # Get first 10 chunks

Advanced Examples

Processing Large PDFs

from crabparser import TextParser

parser = TextParser(chunk_size=2000)

# Parse a large PDF file
chunks = parser.parse_file("research_paper.pdf")

# Process chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

Code File Parsing

# Parse Python code while respecting function boundaries
parser = TextParser(
    chunk_size=1500,
    respect_paragraphs=True  # Keeps functions/classes together
)

chunks = parser.parse_file("main.py")

Batch Processing with Memory Efficiency

from pathlib import Path
from crabparser import TextParser

parser = TextParser(chunk_size=1000)
output_base = Path("output")

for file_path in Path("documents").glob("*.pdf"):
    # Use memory-efficient parsing
    chunked = parser.parse_file_chunked(str(file_path))

    # Process without loading all chunks into memory
    for i in range(len(chunked)):
        chunk = chunked[i]  # Only loads this chunk
        # Process chunk...

    # Save results
    parser.save_chunks(chunked, str(output_base), file_path.stem)

Performance

CrabParser is designed for speed and efficiency:

10x faster than pure Python text processing
Parallel chunk processing using Rayon
Zero-copy operations where possible
Memory-efficient chunk streaming

License

MIT License - see the LICENSE file for details.

Made with 🦀 and ❤️ by the open-source community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Sep 22, 2025

0.1.0

Sep 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crabparser-0.1.1.tar.gz (23.6 kB view details)

Uploaded Sep 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded Sep 22, 2025 CPython 3.9+manylinux: glibc 2.34+ x86-64

File details

Details for the file crabparser-0.1.1.tar.gz.

File metadata

Download URL: crabparser-0.1.1.tar.gz
Upload date: Sep 22, 2025
Size: 23.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crabparser-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`00a2b16f28e54ecb96455ab9d74f77c3ac8e67fa485855b00eae102e0d70b68c`
MD5	`1cd9e4aeae36860d0a6bc467872908c3`
BLAKE2b-256	`590b8a06e74d1f8b2759e8f735b2114841ab5f2f2c6222ab27c3cca0bcea8b21`

See more details on using hashes here.

File details

Details for the file crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl.

File metadata

Download URL: crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl
Upload date: Sep 22, 2025
Size: 1.7 MB
Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`6b8d70bd8f2096ab3ad97193433bb2eaa9b84d3c8dfe338a6874a0992bf4a5d9`
MD5	`fce92ee9e2ce6b80893a375c9834ed9a`
BLAKE2b-256	`f8033197ef3e8e6cc9fad9785116b47e92d6dfc6625c13fe1aa29621675fc4d3`

See more details on using hashes here.

crabparser 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CrabParser 🦀

Key Features

Installation

Quick Start

Supported Formats

Documents

Programming Languages

API Reference

TextParser

ChunkedText

Advanced Examples

Processing Large PDFs

Code File Parsing

Batch Processing with Memory Efficiency

Performance

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes