Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.

These details have not been verified by PyPI

Project links

Project description

Sparrow Parse

A powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the Sparrow ecosystem for intelligent document processing.

✨ Features

🔍 Document Data Extraction: Extract structured data from invoices, forms, tables, and complex documents
🤖 Multiple Backend Support: MLX (Apple Silicon), Ollama, Docker, Hugging Face Cloud GPU, and local GPU inference
📄 Multi-format Support: Images (PNG, JPG, JPEG) and multi-page PDFs
🎯 Schema Validation: JSON schema-based extraction with automatic validation
📊 Table Processing: Specialized table detection and extraction capabilities
🖼️ Image Annotation: Bounding box annotations for extracted data
💬 Text Instructions: Support for instruction-based text processing
⚡ Optimized Processing: Image cropping, resizing, and preprocessing capabilities

🚀 Quick Start

Installation

To run with MLX on macOS Silicon:

pip install sparrow-parse[mlx]

To run with Ollama on Linux/Windows:

pip install sparrow-parse

Additional Requirements:

For PDF processing: brew install poppler (macOS) or apt-get install poppler-utils (Linux)
For MLX backend: Apple Silicon Mac required
For Hugging Face: Valid HF token with GPU access

Basic Usage

from sparrow_parse.vlmb.inference_factory import InferenceFactory
from sparrow_parse.extractors.vllm_extractor import VLLMExtractor

# Initialize extractor
extractor = VLLMExtractor()

# Configure backend (MLX example)
config = {
    "method": "mlx",
    "model_name": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit"
}

# Create inference instance
factory = InferenceFactory(config)
model_inference_instance = factory.get_inference_instance()

# Prepare input data
input_data = [{
    "file_path": "path/to/your/document.png",
    "text_input": "retrieve [{\"field_name\": \"str\", \"amount\": 0}]. return response in JSON format"
}]

# Run inference
results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    debug=True
)

print(f"Extracted data: {results[0]}")

📖 Detailed Usage

Backend Configuration

MLX Backend (Apple Silicon)

config = {
    "method": "mlx",
    "model_name": "mlx-community/Qwen2.5-VL-72B-Instruct-4bit"
}

Ollama Backend

config = {
    "method": "ollama",
    "model_name": "mistral-small3.2:24b-instruct-2506-q8_0"
}

Hugging Face Backend

import os
config = {
    "method": "huggingface",
    "hf_space": "your-username/your-space",
    "hf_token": os.getenv('HF_TOKEN')
}

Local GPU Backend

config = {
    "method": "local_gpu",
    "device": "cuda",
    "model_path": "path/to/model.pth"
}

Input Data Formats

Document Processing

input_data = [{
    "file_path": "invoice.pdf",
    "text_input": "extract invoice data: {\"invoice_number\": \"str\", \"total\": 0, \"date\": \"str\"}"
}]

Text-Only Processing

input_data = [{
    "file_path": None,
    "text_input": "Summarize the key points about renewable energy."
}]

Advanced Options

Table Extraction Only

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    tables_only=True  # Extract only tables from document
)

Image Cropping

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    crop_size=60  # Crop 60 pixels from all borders
)

Bounding Box Annotations

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    apply_annotation=True  # Include bounding box coordinates
)

Generic Data Extraction

results, num_pages = extractor.run_inference(
    model_inference_instance,
    input_data,
    generic_query=True  # Extract all available data
)

🛠️ Utility Functions

PDF Processing

from sparrow_parse.helpers.pdf_optimizer import PDFOptimizer

pdf_optimizer = PDFOptimizer()
num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(
    file_path="document.pdf",
    debug_dir="./debug",
    convert_to_images=True
)

Image Optimization

from sparrow_parse.helpers.image_optimizer import ImageOptimizer

image_optimizer = ImageOptimizer()
cropped_path = image_optimizer.crop_image_borders(
    file_path="image.jpg",
    temp_dir="./temp",
    debug_dir="./debug",
    crop_size=50
)

Table Detection

from sparrow_parse.processors.table_structure_processor import TableDetector

detector = TableDetector()
cropped_tables = detector.detect_tables(
    file_path="document.png",
    local=True,
    debug=True
)

🎯 Use Cases & Examples

Invoice Processing

invoice_schema = {
    "invoice_number": "str",
    "date": "str", 
    "vendor_name": "str",
    "total_amount": 0,
    "line_items": [{
        "description": "str",
        "quantity": 0,
        "price": 0.0
    }]
}

input_data = [{
    "file_path": "invoice.pdf",
    "text_input": f"extract invoice data: {json.dumps(invoice_schema)}"
}]

Financial Tables

table_schema = [{
    "instrument_name": "str",
    "valuation": 0,
    "currency": "str or null"
}]

input_data = [{
    "file_path": "financial_report.png", 
    "text_input": f"retrieve {json.dumps(table_schema)}. return response in JSON format"
}]

Form Processing

form_schema = {
    "applicant_name": "str",
    "application_date": "str",
    "fields": [{
        "field_name": "str",
        "field_value": "str or null"
    }]
}

⚙️ Configuration Options

Parameter	Type	Default	Description
`tables_only`	bool	False	Extract only tables from documents
`generic_query`	bool	False	Extract all available data without schema
`crop_size`	int	None	Pixels to crop from image borders
`apply_annotation`	bool	False	Include bounding box coordinates
`ocr_callback`	str	None	Callback for OCR
`debug_dir`	str	None	Directory to save debug images
`debug`	bool	False	Enable debug logging
`mode`	str	None	Set to "static" for mock responses

🔧 Troubleshooting

Common Issues

Import Errors:

# For MLX backend on non-Apple Silicon
pip install sparrow-parse --no-deps
pip install -r requirements.txt --exclude mlx-vlm

# For missing poppler
brew install poppler  # macOS
sudo apt-get install poppler-utils  # Ubuntu/Debian

Memory Issues:

Use smaller models or reduce image resolution
Enable image cropping to reduce processing load
Process single pages instead of entire PDFs

Model Loading Errors:

Verify model name and availability
Check HF token permissions for private models
Ensure sufficient disk space for model downloads

Performance Tips

Image Size: Resize large images before processing
Batch Processing: Process multiple pages together when possible
Model Selection: Choose appropriate model size for your hardware
Caching: Models are cached after first load

📚 API Reference

VLLMExtractor Class

class VLLMExtractor:
    def run_inference(
        self,
        model_inference_instance,
        input_data: List[Dict],
        tables_only: bool = False,
        generic_query: bool = False, 
        crop_size: Optional[int] = None,
        apply_annotation: bool = False,
        ocr_callback: Optional[str] = None, 
        debug_dir: Optional[str] = None,
        debug: bool = False,
        mode: Optional[str] = None
    ) -> Tuple[List[str], int]

InferenceFactory Class

class InferenceFactory:
    def __init__(self, config: Dict)
    def get_inference_instance(self) -> ModelInference

🏗️ Development

Building from Source

# Clone repository
git clone https://github.com/katanaml/sparrow.git
cd sparrow/sparrow-data/parse

# Create virtual environment
python -m venv .env_sparrow_parse
source .env_sparrow_parse/bin/activate  # Linux/Mac
# or
.env_sparrow_parse\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Build package
pip install setuptools wheel
python setup.py sdist bdist_wheel

# Install locally
pip install -e .

Running Tests

python -m pytest tests/

📄 Supported File Formats

Format	Extension	Multi-page	Notes
PNG	.png	❌	Recommended for tables/forms
JPEG	.jpg, .jpeg	❌	Good for photos/scanned docs
PDF	.pdf	✅	Automatically split into pages

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📞 Support

📜 License

Commercial Licensing: Free for organizations with revenue under $5M USD annually. Contact us for commercial licensing options.

👥 Authors

Katana ML
Andrej Baranovskij

⭐ Star us on GitHub if you find Sparrow Parse useful!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.5.0

May 22, 2026

1.4.9

May 19, 2026

1.4.8

Apr 20, 2026

1.4.7

Apr 16, 2026

1.4.6

Apr 15, 2026

1.4.5

Apr 14, 2026

1.4.4

Apr 14, 2026

1.4.3

Apr 12, 2026

1.4.2

Mar 25, 2026

1.4.1

Mar 24, 2026

1.3.10

Mar 16, 2026

1.3.9

Mar 10, 2026

1.3.8

Mar 9, 2026

1.3.7

Mar 5, 2026

1.3.6

Feb 24, 2026

1.3.5

Feb 22, 2026

1.3.4

Feb 13, 2026

1.3.3

Feb 13, 2026

1.3.2

Feb 13, 2026

1.3.1

Feb 13, 2026

1.3.0

Feb 13, 2026

1.2.7

Feb 11, 2026

1.2.6

Feb 10, 2026

1.2.5

Feb 8, 2026

1.2.4

Feb 6, 2026

1.2.3

Jan 20, 2026

1.2.2

Jan 20, 2026

1.2.1

Dec 17, 2025

1.2.0

Dec 15, 2025

1.1.9

Nov 18, 2025

1.1.8

Nov 3, 2025

1.1.7

Nov 3, 2025

1.1.6

Oct 20, 2025

1.1.5

Oct 16, 2025

1.1.4

Oct 15, 2025

1.1.3

Sep 14, 2025

1.1.2

May 24, 2025

1.1.1

May 22, 2025

1.1.0

May 22, 2025

1.0.9

May 21, 2025

1.0.8

May 21, 2025

1.0.7

May 18, 2025

1.0.6

May 15, 2025

1.0.5

May 7, 2025

1.0.4

Apr 28, 2025

1.0.4a0 pre-release

May 8, 2025

1.0.3

Apr 20, 2025

1.0.2

Apr 2, 2025

1.0.1

Mar 27, 2025

0.5.5

Feb 26, 2025

0.5.4

Jan 31, 2025

0.5.3

Jan 31, 2025

0.5.2

Jan 20, 2025

0.5.1

Jan 13, 2025

0.5.0

Jan 9, 2025

0.4.11

Dec 23, 2024

0.4.10

Dec 20, 2024

0.4.9

Dec 20, 2024

0.4.8

Dec 20, 2024

0.4.7

Dec 20, 2024

0.4.6

Dec 18, 2024

0.4.5

Dec 18, 2024

0.4.4

Dec 18, 2024

0.4.3

Dec 15, 2024

0.4.2

Dec 13, 2024

0.4.1

Dec 11, 2024

0.4.0

Dec 11, 2024

0.3.12

Dec 6, 2024

0.3.11

Dec 5, 2024

0.3.10

Dec 4, 2024

0.3.9

Nov 29, 2024

0.3.8

Nov 26, 2024

0.3.7

Nov 23, 2024

0.3.6

Nov 12, 2024

0.3.5

Nov 8, 2024

0.3.4

Sep 25, 2024

0.3.3

Sep 25, 2024

0.3.2

Jul 18, 2024

0.3.1

Jul 18, 2024

0.3.0

Jun 26, 2024

0.2.9

Jun 21, 2024

0.2.8

Jun 21, 2024

0.2.7

Jun 21, 2024

0.2.6

Jun 20, 2024

0.2.5

Jun 20, 2024

0.2.4

Jun 18, 2024

0.2.3

Jun 18, 2024

0.2.2

Jun 18, 2024

0.2.1

Jun 14, 2024

0.2.0

Jun 13, 2024

0.1.10

Jun 13, 2024

0.1.9

May 4, 2024

0.1.8

Apr 29, 2024

0.1.7

Apr 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparrow_parse-1.5.0.tar.gz (29.3 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparrow_parse-1.5.0-py3-none-any.whl (34.2 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file sparrow_parse-1.5.0.tar.gz.

File metadata

Download URL: sparrow_parse-1.5.0.tar.gz
Upload date: May 22, 2026
Size: 29.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for sparrow_parse-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`5371900c59970e5d014cda43b3a011c7308c4a9e4b5fd526cd0c2319d7f17c00`
MD5	`dc0e7fe79a3ba3151104211043dd750b`
BLAKE2b-256	`33911c0c364359dea224c2abf734873d2c7cacbe9593b2d23bfebc2b77e55d2e`

See more details on using hashes here.

File details

Details for the file sparrow_parse-1.5.0-py3-none-any.whl.

File metadata

Download URL: sparrow_parse-1.5.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 34.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for sparrow_parse-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95f327081364fc84393340e3225823994e19453a528f99f3042973d6c882ea74`
MD5	`e93ba62066ec2875d42405acc9768031`
BLAKE2b-256	`f1b9cfb98754956875c214ea3e5b26ae3d5772fbdd980d77fc385bf688391c37`

See more details on using hashes here.

sparrow-parse 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sparrow Parse

✨ Features

🚀 Quick Start

Installation

Basic Usage

📖 Detailed Usage

Backend Configuration

MLX Backend (Apple Silicon)

Ollama Backend

Hugging Face Backend

Local GPU Backend

Input Data Formats

Document Processing

Text-Only Processing

Advanced Options

Table Extraction Only

Image Cropping

Bounding Box Annotations

Generic Data Extraction

🛠️ Utility Functions

PDF Processing

Image Optimization

Table Detection

🎯 Use Cases & Examples

Invoice Processing

Financial Tables

Form Processing

⚙️ Configuration Options

🔧 Troubleshooting

Common Issues

Performance Tips

📚 API Reference

VLLMExtractor Class

InferenceFactory Class

🏗️ Development

Building from Source

Running Tests

📄 Supported File Formats

🤝 Contributing

📞 Support

📜 License

👥 Authors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes