Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.
Project description
Sparrow Parse
A powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the Sparrow ecosystem for intelligent document processing.
✨ Features
- 🔍 Document Data Extraction: Extract structured data from invoices, forms, tables, and complex documents
- 🤖 Multiple Backend Support: MLX (Apple Silicon), Ollama, Docker, Hugging Face Cloud GPU, and local GPU inference
- 📄 Multi-format Support: Images (PNG, JPG, JPEG) and multi-page PDFs
- 🎯 Schema Validation: JSON schema-based extraction with automatic validation
- 📊 Table Processing: Specialized table detection and extraction capabilities
- 🖼️ Image Annotation: Bounding box annotations for extracted data
- 💬 Text Instructions: Support for instruction-based text processing
- ⚡ Optimized Processing: Image cropping, resizing, and preprocessing capabilities
🚀 Quick Start
Installation
To run with MLX on macOS Silicon:
pip install sparrow-parse[mlx]
To run with Ollama on Linux/Windows:
pip install sparrow-parse
Additional Requirements:
- For PDF processing:
brew install poppler(macOS) orapt-get install poppler-utils(Linux) - For MLX backend: Apple Silicon Mac required
- For Hugging Face: Valid HF token with GPU access
Basic Usage
from sparrow_parse.vlmb.inference_factory import InferenceFactory
from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
# Initialize extractor
extractor = VLLMExtractor()
# Configure backend (MLX example)
config = {
"method": "mlx",
"model_name": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit"
}
# Create inference instance
factory = InferenceFactory(config)
model_inference_instance = factory.get_inference_instance()
# Prepare input data
input_data = [{
"file_path": "path/to/your/document.png",
"text_input": "retrieve [{\"field_name\": \"str\", \"amount\": 0}]. return response in JSON format"
}]
# Run inference
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
debug=True
)
print(f"Extracted data: {results[0]}")
📖 Detailed Usage
Backend Configuration
MLX Backend (Apple Silicon)
config = {
"method": "mlx",
"model_name": "mlx-community/Qwen2.5-VL-72B-Instruct-4bit"
}
Ollama Backend
config = {
"method": "ollama",
"model_name": "mistral-small3.2:24b-instruct-2506-q8_0"
}
Hugging Face Backend
import os
config = {
"method": "huggingface",
"hf_space": "your-username/your-space",
"hf_token": os.getenv('HF_TOKEN')
}
Local GPU Backend
config = {
"method": "local_gpu",
"device": "cuda",
"model_path": "path/to/model.pth"
}
Input Data Formats
Document Processing
input_data = [{
"file_path": "invoice.pdf",
"text_input": "extract invoice data: {\"invoice_number\": \"str\", \"total\": 0, \"date\": \"str\"}"
}]
Text-Only Processing
input_data = [{
"file_path": None,
"text_input": "Summarize the key points about renewable energy."
}]
Advanced Options
Table Extraction Only
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
tables_only=True # Extract only tables from document
)
Image Cropping
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
crop_size=60 # Crop 60 pixels from all borders
)
Bounding Box Annotations
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
apply_annotation=True # Include bounding box coordinates
)
Generic Data Extraction
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
generic_query=True # Extract all available data
)
🛠️ Utility Functions
PDF Processing
from sparrow_parse.helpers.pdf_optimizer import PDFOptimizer
pdf_optimizer = PDFOptimizer()
num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(
file_path="document.pdf",
debug_dir="./debug",
convert_to_images=True
)
Image Optimization
from sparrow_parse.helpers.image_optimizer import ImageOptimizer
image_optimizer = ImageOptimizer()
cropped_path = image_optimizer.crop_image_borders(
file_path="image.jpg",
temp_dir="./temp",
debug_dir="./debug",
crop_size=50
)
Table Detection
from sparrow_parse.processors.table_structure_processor import TableDetector
detector = TableDetector()
cropped_tables = detector.detect_tables(
file_path="document.png",
local=True,
debug=True
)
🎯 Use Cases & Examples
Invoice Processing
invoice_schema = {
"invoice_number": "str",
"date": "str",
"vendor_name": "str",
"total_amount": 0,
"line_items": [{
"description": "str",
"quantity": 0,
"price": 0.0
}]
}
input_data = [{
"file_path": "invoice.pdf",
"text_input": f"extract invoice data: {json.dumps(invoice_schema)}"
}]
Financial Tables
table_schema = [{
"instrument_name": "str",
"valuation": 0,
"currency": "str or null"
}]
input_data = [{
"file_path": "financial_report.png",
"text_input": f"retrieve {json.dumps(table_schema)}. return response in JSON format"
}]
Form Processing
form_schema = {
"applicant_name": "str",
"application_date": "str",
"fields": [{
"field_name": "str",
"field_value": "str or null"
}]
}
⚙️ Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
tables_only |
bool | False | Extract only tables from documents |
generic_query |
bool | False | Extract all available data without schema |
crop_size |
int | None | Pixels to crop from image borders |
apply_annotation |
bool | False | Include bounding box coordinates |
ocr_callback |
str | None | Callback for OCR |
debug_dir |
str | None | Directory to save debug images |
debug |
bool | False | Enable debug logging |
mode |
str | None | Set to "static" for mock responses |
🔧 Troubleshooting
Common Issues
Import Errors:
# For MLX backend on non-Apple Silicon
pip install sparrow-parse --no-deps
pip install -r requirements.txt --exclude mlx-vlm
# For missing poppler
brew install poppler # macOS
sudo apt-get install poppler-utils # Ubuntu/Debian
Memory Issues:
- Use smaller models or reduce image resolution
- Enable image cropping to reduce processing load
- Process single pages instead of entire PDFs
Model Loading Errors:
- Verify model name and availability
- Check HF token permissions for private models
- Ensure sufficient disk space for model downloads
Performance Tips
- Image Size: Resize large images before processing
- Batch Processing: Process multiple pages together when possible
- Model Selection: Choose appropriate model size for your hardware
- Caching: Models are cached after first load
📚 API Reference
VLLMExtractor Class
class VLLMExtractor:
def run_inference(
self,
model_inference_instance,
input_data: List[Dict],
tables_only: bool = False,
generic_query: bool = False,
crop_size: Optional[int] = None,
apply_annotation: bool = False,
ocr_callback: Optional[str] = None,
debug_dir: Optional[str] = None,
debug: bool = False,
mode: Optional[str] = None
) -> Tuple[List[str], int]
InferenceFactory Class
class InferenceFactory:
def __init__(self, config: Dict)
def get_inference_instance(self) -> ModelInference
🏗️ Development
Building from Source
# Clone repository
git clone https://github.com/katanaml/sparrow.git
cd sparrow/sparrow-data/parse
# Create virtual environment
python -m venv .env_sparrow_parse
source .env_sparrow_parse/bin/activate # Linux/Mac
# or
.env_sparrow_parse\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Build package
pip install setuptools wheel
python setup.py sdist bdist_wheel
# Install locally
pip install -e .
Running Tests
python -m pytest tests/
📄 Supported File Formats
| Format | Extension | Multi-page | Notes |
|---|---|---|---|
| PNG | .png | ❌ | Recommended for tables/forms |
| JPEG | .jpg, .jpeg | ❌ | Good for photos/scanned docs |
| ✅ | Automatically split into pages |
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
📞 Support
📜 License
Licensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij.
Commercial Licensing: Free for organizations with revenue under $5M USD annually. Contact us for commercial licensing options.
👥 Authors
⭐ Star us on GitHub if you find Sparrow Parse useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparrow_parse-1.4.2.tar.gz.
File metadata
- Download URL: sparrow_parse-1.4.2.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa17e630a24460e2a740627edbc37e61e8e72503344a67726958483a182e4c00
|
|
| MD5 |
baa6cf49afa31a417fdf6036873cd91b
|
|
| BLAKE2b-256 |
0ffeb67326962368c149cdf992b8429fccbac15cd50a1886bcf0da355c2d781e
|
File details
Details for the file sparrow_parse-1.4.2-py3-none-any.whl.
File metadata
- Download URL: sparrow_parse-1.4.2-py3-none-any.whl
- Upload date:
- Size: 34.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee68f97d25dab23255c42299207a5ac1cf2d521565139e36d7a9ac0bf36b1375
|
|
| MD5 |
128e8abf64a430e4b76024961ca9fc74
|
|
| BLAKE2b-256 |
4d73fbf86a928fe1c88f6ea5bb972bdbb01da939e047e415ad5dcb800e9aa60d
|