Skip to main content

Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.

Project description

Sparrow Parse

Description

This module implements Sparrow Parse library library with helpful methods for data pre-processing, parsing and extracting information. This library relies on Visual LLM functionality, Table Transformers and is part of Sparrow. Check main README

Install

pip install sparrow-parse

Parsing and extraction

Sparrow Parse VL (vision-language model) extractor with Hugging Face GPU infra

# run locally: python -m sparrow_parse.extractors.vllm_extractor

from sparrow_parse.vllm.inference_factory import InferenceFactory
from sparrow_parse.extractors.vllm_extractor import VLLMExtractor

extractor = VLLMExtractor()

# export HF_TOKEN="hf_"
config = {
    "method": "huggingface",  # Could be 'huggingface' or 'local_gpu'
    "hf_space": "katanaml/sparrow-qwen2-vl-7b",
    "hf_token": os.getenv('HF_TOKEN'),
    # Additional fields for local GPU inference
    # "device": "cuda", "model_path": "model.pth"
}

# Use the factory to get the correct instance
factory = InferenceFactory(config)
model_inference_instance = factory.get_inference_instance()

input_data = [
    {
        "file_path": "/data/oracle_10k_2014_q1_small.pdf",
        "text_input": "retrieve {"table": [{"description": "str", "latest_amount": 0, "previous_amount": 0}]}. return response in JSON format"
    }
]

# Now you can run inference without knowing which implementation is used
results_array, num_pages = extractor.run_inference(model_inference_instance, input_data, generic_query=False,
                                 debug_dir="/data/",
                                 debug=True,
                                 mode=None)

for i, result in enumerate(results_array):
    print(f"Result for page {i + 1}:", result)
print(f"Number of pages: {num_pages}")

Use mode="static" if you want to simulate LLM call, without executing LLM backend.

Method run_inference will return results and number of pages processed.

Note: GPU backend katanaml/sparrow-qwen2-vl-7b is private, to be able to run below command, you need to create your own backend on Hugging Face space using code from Sparrow Parse.

PDF pre-processing

from sparrow_parse.extractor.pdf_optimizer import PDFOptimizer

pdf_optimizer = PDFOptimizer()

num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(file_path,
                                                                     output_directory,
                                                                     convert_to_images)

Example:

file_path - /data/invoice_1.pdf

output_directory - set to not None, for debug purposes only

convert_to_images - default False, to split into PDF files

Library build

Create Python virtual environment

python -m venv .env_sparrow_parse

Install Python libraries

pip install -r requirements.txt

Build package

pip install setuptools wheel
python setup.py sdist bdist_wheel

Upload to PyPI

pip install twine
twine upload dist/*

Commercial usage

Sparrow is available under the GPL 3.0 license, promoting freedom to use, modify, and distribute the software while ensuring any modifications remain open source under the same license. This aligns with our commitment to supporting the open-source community and fostering collaboration.

Additionally, we recognize the diverse needs of organizations, including small to medium-sized enterprises (SMEs). Therefore, Sparrow is also offered for free commercial use to organizations with gross revenue below $5 million USD in the past 12 months, enabling them to leverage Sparrow without the financial burden often associated with high-quality software solutions.

For businesses that exceed this revenue threshold or require usage terms not accommodated by the GPL 3.0 license—such as integrating Sparrow into proprietary software without the obligation to disclose source code modifications—we offer dual licensing options. Dual licensing allows Sparrow to be used under a separate proprietary license, offering greater flexibility for commercial applications and proprietary integrations. This model supports both the project's sustainability and the business's needs for confidentiality and customization.

If your organization is seeking to utilize Sparrow under a proprietary license, or if you are interested in custom workflows, consulting services, or dedicated support and maintenance options, please contact us at abaranovskis@redsamuraiconsulting.com. We're here to provide tailored solutions that meet your unique requirements, ensuring you can maximize the benefits of Sparrow for your projects and workflows.

Author

Katana ML, Andrej Baranovskij

License

Licensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. Copy of the license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparrow-parse-0.3.8.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

sparrow_parse-0.3.8-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file sparrow-parse-0.3.8.tar.gz.

File metadata

  • Download URL: sparrow-parse-0.3.8.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.4

File hashes

Hashes for sparrow-parse-0.3.8.tar.gz
Algorithm Hash digest
SHA256 1ce331a3fb27699776b6e12fc0bdd593d355b97432d45928f6587e136b96d60e
MD5 b8b85100aa5b2081731eb29bf520992b
BLAKE2b-256 da1671b85c3e3e7dc24256cf8e39f56751328e1496634a4e91ab01964286a4e5

See more details on using hashes here.

File details

Details for the file sparrow_parse-0.3.8-py3-none-any.whl.

File metadata

File hashes

Hashes for sparrow_parse-0.3.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2ed3e1224d6c8b83ce5ccb3b74b9a3ba6e41f64b41a875e4bd71eabd3f08372e
MD5 00cb1e9c655cf69ee8e03788c460a87c
BLAKE2b-256 6f0fba32a3d2bcce3e0327b31853e2b3bd8bf70bbe6f44cc4f849a4f4edf4bee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page