🦀 Blazingly fast text parsing library with Rust backend - 10x faster than pure Python with support for PDF, DOCX, CSV and 12+ programming languages
Project description
CrabParser 🦀
High-performance text parsing library written in Rust with Python bindings
CrabParser is a blazingly fast text parsing library that splits documents and code files into semantic chunks. Built with Rust for maximum performance and memory efficiency, it provides Python bindings for easy integration into your projects.
Key Features
- 🚀 Pure Rust Performance - 10x faster than pure Python implementations
- 📄 Multi-Format Support - Handles TXT, PDF, DOCX, CSV, and 12+ programming languages
- 🛡️ Bulletproof Encoding - Never fails on any file, handles all encodings gracefully
- 💾 Memory Efficient - ChunkedText keeps data in Rust memory with Python access
- ⚡ Parallel Processing - Leverages Rayon for concurrent operations
- 🧩 Semantic Chunking - Respects document structure (paragraphs, sentences, code blocks)
Installation
pip install crabparser
Quick Start
from crabparser import TextParser, ChunkedText
# Create a parser instance
parser = TextParser(
chunk_size=1000, # Maximum characters per chunk
respect_paragraphs=True, # Keep paragraphs together
respect_sentences=True # Split at sentence boundaries
)
# Parse text
text = "Your long document text here..."
chunks = parser.parse(text)
print(f"Split into {len(chunks)} chunks")
# Parse with memory-efficient ChunkedText
chunked = parser.parse_chunked(text)
print(f"First chunk: {chunked[0]}")
print(f"Total size: {chunked.total_size} bytes")
# Parse files directly (auto-detects format)
chunks = parser.parse_file("document.pdf") # Works with PDF, DOCX, CSV, TXT, and code files
# Save chunks to files
parser.save_chunks(chunks, "output_dir", "document")
Supported Formats
Documents
- PDF - Extracts and preserves text content
- DOCX - Full support for Word documents
- CSV - Intelligent handling of structured data
- TXT - Universal text files with encoding detection
Programming Languages
Semantic code parsing that respects function and class boundaries:
- Python, JavaScript, TypeScript, Rust
- Go, Java, C#, C++
- Ruby, PHP, Swift, Kotlin
API Reference
TextParser
The main parser class for processing text and files.
parser = TextParser(
chunk_size=1000, # Maximum size of each chunk
respect_paragraphs=True, # Maintain paragraph boundaries
respect_sentences=True # Maintain sentence boundaries
)
Methods:
parse(text: str) -> List[str]- Parse text into chunksparse_chunked(text: str) -> ChunkedText- Memory-efficient parsingparse_file(path: str) -> List[str]- Parse any supported fileparse_file_chunked(path: str) -> ChunkedText- Memory-efficient file parsingsave_chunks(chunks, output_dir, base_name) -> int- Save chunks to files
ChunkedText
Memory-efficient container that keeps chunks in Rust memory.
# Access chunks without loading all into Python memory
chunked[0] # First chunk
chunked[-1] # Last chunk
len(chunked) # Number of chunks
chunked.total_size # Total size in bytes
chunked.source_file # Source file path (if applicable)
# Iteration
for chunk in chunked:
process(chunk)
# Get slice of chunks
batch = chunked.get_slice(0, 10) # Get first 10 chunks
Advanced Examples
Processing Large PDFs
from crabparser import TextParser
parser = TextParser(chunk_size=2000)
# Parse a large PDF file
chunks = parser.parse_file("research_paper.pdf")
# Process chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk[:100]}...")
Code File Parsing
# Parse Python code while respecting function boundaries
parser = TextParser(
chunk_size=1500,
respect_paragraphs=True # Keeps functions/classes together
)
chunks = parser.parse_file("main.py")
Batch Processing with Memory Efficiency
from pathlib import Path
from crabparser import TextParser
parser = TextParser(chunk_size=1000)
output_base = Path("output")
for file_path in Path("documents").glob("*.pdf"):
# Use memory-efficient parsing
chunked = parser.parse_file_chunked(str(file_path))
# Process without loading all chunks into memory
for i in range(len(chunked)):
chunk = chunked[i] # Only loads this chunk
# Process chunk...
# Save results
parser.save_chunks(chunked, str(output_base), file_path.stem)
Performance
CrabParser is designed for speed and efficiency:
- 10x faster than pure Python text processing
- Parallel chunk processing using Rayon
- Zero-copy operations where possible
- Memory-efficient chunk streaming
Links
License
MIT License - see the LICENSE file for details.
Made with 🦀 and ❤️ by the open-source community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crabparser-0.1.1.tar.gz.
File metadata
- Download URL: crabparser-0.1.1.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00a2b16f28e54ecb96455ab9d74f77c3ac8e67fa485855b00eae102e0d70b68c
|
|
| MD5 |
1cd9e4aeae36860d0a6bc467872908c3
|
|
| BLAKE2b-256 |
590b8a06e74d1f8b2759e8f735b2114841ab5f2f2c6222ab27c3cca0bcea8b21
|
File details
Details for the file crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: crabparser-0.1.1-cp39-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b8d70bd8f2096ab3ad97193433bb2eaa9b84d3c8dfe338a6874a0992bf4a5d9
|
|
| MD5 |
fce92ee9e2ce6b80893a375c9834ed9a
|
|
| BLAKE2b-256 |
f8033197ef3e8e6cc9fad9785116b47e92d6dfc6625c13fe1aa29621675fc4d3
|