Skip to main content

Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.

Project description

Sparrow Parse

PyPI version Python 3.12+ License: GPL v3

A powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the Sparrow ecosystem for intelligent document processing.

✨ Features

  • 🔍 Document Data Extraction: Extract structured data from invoices, forms, tables, and complex documents
  • 🤖 Multiple Backend Support: MLX (Apple Silicon), Ollama, Docker, Hugging Face Cloud GPU, and local GPU inference
  • 📄 Multi-format Support: Images (PNG, JPG, JPEG) and multi-page PDFs
  • 🎯 Schema Validation: JSON schema-based extraction with automatic validation
  • 📊 Table Processing: Specialized table detection and extraction capabilities
  • 🖼️ Image Annotation: Bounding box annotations for extracted data
  • 💬 Text Instructions: Support for instruction-based text processing
  • Optimized Processing: Image cropping, resizing, and preprocessing capabilities

🚀 Quick Start

Installation

To run with MLX on macOS Silicon:

pip install sparrow-parse[mlx]

To run with Ollama on Linux/Windows:

pip install sparrow-parse

Additional Requirements:

  • For PDF processing: brew install poppler (macOS) or apt-get install poppler-utils (Linux)
  • For MLX backend: Apple Silicon Mac required
  • For Hugging Face: Valid HF token with GPU access

Basic Usage

from sparrow_parse.vlmb.inference_factory import InferenceFactory
from sparrow_parse.extractors.vllm_extractor import VLLMExtractor

# Initialize extractor
extractor = VLLMExtractor()

# Configure backend (MLX example)
config = {
    "method": "mlx",
    "model_name": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit"
}

# Create inference instance
factory = InferenceFactory(config)
model_inference_instance = factory.get_inference_instance()

# Prepare input data
input_data = [{
    "file_path": "path/to/your/document.png",
    "text_input": "retrieve [{\"field_name\": \"str\", \"amount\": 0}]. return response in JSON format"
}]

# Run inference
results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    debug=True
)

print(f"Extracted data: {results[0]}")

📖 Detailed Usage

Backend Configuration

MLX Backend (Apple Silicon)

config = {
    "method": "mlx",
    "model_name": "mlx-community/Qwen2.5-VL-72B-Instruct-4bit"
}

Ollama Backend

config = {
    "method": "ollama",
    "model_name": "mistral-small3.2:24b-instruct-2506-q8_0"
}

Hugging Face Backend

import os
config = {
    "method": "huggingface",
    "hf_space": "your-username/your-space",
    "hf_token": os.getenv('HF_TOKEN')
}

Local GPU Backend

config = {
    "method": "local_gpu",
    "device": "cuda",
    "model_path": "path/to/model.pth"
}

Input Data Formats

Document Processing

input_data = [{
    "file_path": "invoice.pdf",
    "text_input": "extract invoice data: {\"invoice_number\": \"str\", \"total\": 0, \"date\": \"str\"}"
}]

Text-Only Processing

input_data = [{
    "file_path": None,
    "text_input": "Summarize the key points about renewable energy."
}]

Advanced Options

Table Extraction Only

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    tables_only=True  # Extract only tables from document
)

Image Cropping

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    crop_size=60  # Crop 60 pixels from all borders
)

Bounding Box Annotations

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    apply_annotation=True  # Include bounding box coordinates
)

Generic Data Extraction

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    generic_query=True  # Extract all available data
)

🛠️ Utility Functions

PDF Processing

from sparrow_parse.helpers.pdf_optimizer import PDFOptimizer

pdf_optimizer = PDFOptimizer()
num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(
    file_path="document.pdf",
    debug_dir="./debug",
    convert_to_images=True
)

Image Optimization

from sparrow_parse.helpers.image_optimizer import ImageOptimizer

image_optimizer = ImageOptimizer()
cropped_path = image_optimizer.crop_image_borders(
    file_path="image.jpg",
    temp_dir="./temp",
    debug_dir="./debug",
    crop_size=50
)

Table Detection

from sparrow_parse.processors.table_structure_processor import TableDetector

detector = TableDetector()
cropped_tables = detector.detect_tables(
    file_path="document.png",
    local=True,
    debug=True
)

🎯 Use Cases & Examples

Invoice Processing

invoice_schema = {
    "invoice_number": "str",
    "date": "str", 
    "vendor_name": "str",
    "total_amount": 0,
    "line_items": [{
        "description": "str",
        "quantity": 0,
        "price": 0.0
    }]
}

input_data = [{
    "file_path": "invoice.pdf",
    "text_input": f"extract invoice data: {json.dumps(invoice_schema)}"
}]

Financial Tables

table_schema = [{
    "instrument_name": "str",
    "valuation": 0,
    "currency": "str or null"
}]

input_data = [{
    "file_path": "financial_report.png", 
    "text_input": f"retrieve {json.dumps(table_schema)}. return response in JSON format"
}]

Form Processing

form_schema = {
    "applicant_name": "str",
    "application_date": "str",
    "fields": [{
        "field_name": "str",
        "field_value": "str or null"
    }]
}

⚙️ Configuration Options

Parameter Type Default Description
tables_only bool False Extract only tables from documents
generic_query bool False Extract all available data without schema
crop_size int None Pixels to crop from image borders
apply_annotation bool False Include bounding box coordinates
ocr_callback str None Callback for OCR
debug_dir str None Directory to save debug images
debug bool False Enable debug logging
mode str None Set to "static" for mock responses

🔧 Troubleshooting

Common Issues

Import Errors:

# For MLX backend on non-Apple Silicon
pip install sparrow-parse --no-deps
pip install -r requirements.txt --exclude mlx-vlm

# For missing poppler
brew install poppler  # macOS
sudo apt-get install poppler-utils  # Ubuntu/Debian

Memory Issues:

  • Use smaller models or reduce image resolution
  • Enable image cropping to reduce processing load
  • Process single pages instead of entire PDFs

Model Loading Errors:

  • Verify model name and availability
  • Check HF token permissions for private models
  • Ensure sufficient disk space for model downloads

Performance Tips

  • Image Size: Resize large images before processing
  • Batch Processing: Process multiple pages together when possible
  • Model Selection: Choose appropriate model size for your hardware
  • Caching: Models are cached after first load

📚 API Reference

VLLMExtractor Class

class VLLMExtractor:
    def run_inference(
        self,
        model_inference_instance,
        input_data: List[Dict],
        tables_only: bool = False,
        generic_query: bool = False, 
        crop_size: Optional[int] = None,
        apply_annotation: bool = False,
        ocr_callback: Optional[str] = None, 
        debug_dir: Optional[str] = None,
        debug: bool = False,
        mode: Optional[str] = None
    ) -> Tuple[List[str], int]

InferenceFactory Class

class InferenceFactory:
    def __init__(self, config: Dict)
    def get_inference_instance(self) -> ModelInference

🏗️ Development

Building from Source

# Clone repository
git clone https://github.com/katanaml/sparrow.git
cd sparrow/sparrow-data/parse

# Create virtual environment
python -m venv .env_sparrow_parse
source .env_sparrow_parse/bin/activate  # Linux/Mac
# or
.env_sparrow_parse\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Build package
pip install setuptools wheel
python setup.py sdist bdist_wheel

# Install locally
pip install -e .

Running Tests

python -m pytest tests/

📄 Supported File Formats

Format Extension Multi-page Notes
PNG .png Recommended for tables/forms
JPEG .jpg, .jpeg Good for photos/scanned docs
PDF .pdf Automatically split into pages

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📞 Support

📜 License

Licensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij.

Commercial Licensing: Free for organizations with revenue under $5M USD annually. Contact us for commercial licensing options.

👥 Authors


Star us on GitHub if you find Sparrow Parse useful!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparrow_parse-1.4.2.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparrow_parse-1.4.2-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file sparrow_parse-1.4.2.tar.gz.

File metadata

  • Download URL: sparrow_parse-1.4.2.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for sparrow_parse-1.4.2.tar.gz
Algorithm Hash digest
SHA256 fa17e630a24460e2a740627edbc37e61e8e72503344a67726958483a182e4c00
MD5 baa6cf49afa31a417fdf6036873cd91b
BLAKE2b-256 0ffeb67326962368c149cdf992b8429fccbac15cd50a1886bcf0da355c2d781e

See more details on using hashes here.

File details

Details for the file sparrow_parse-1.4.2-py3-none-any.whl.

File metadata

  • Download URL: sparrow_parse-1.4.2-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for sparrow_parse-1.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ee68f97d25dab23255c42299207a5ac1cf2d521565139e36d7a9ac0bf36b1375
MD5 128e8abf64a430e4b76024961ca9fc74
BLAKE2b-256 4d73fbf86a928fe1c88f6ea5bb972bdbb01da939e047e415ad5dcb800e9aa60d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page