Skip to main content

Sparrow Parse is a Python package for parsing and extracting information from documents.

Project description

Sparrow Parse

Description

This module implements Sparrow Parse library with helpful methods for data pre-processing, parsing and extracting information.

Install

pip install sparrow-parse

Pre-processing

Unstructured

from sparrow_parse.extractor.unstructured_processor import UnstructuredProcessor

processor = UnstructuredProcessor()

content, table_content = processor.extract_data(
        file_path,  # file to process
        strategy,  # data processing strategy supported by unstructured
        model_name,  # model supported by unstructured
        options,  # table extraction into HTML format
        local,  # True if running from CLI, or False if running from FastAPI
        debug)  # Debug

Example:

file_path - /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf

strategy - hi_res

model_name - yolox

options - ['tables', 'unstructured']

local - True

debug - True

Markdown

from sparrow_parse.extractor.markdown_processor import MarkdownProcessor

processor = MarkdownProcessor()

content, table_content = processor.extract_data(
        file_path,  # file to process
        options,  # table extraction into HTML format
        local,  # True if running from CLI, or False if running from FastAPI
        debug)  # Debug

Example:

file_path - /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf

options - ['tables', 'markdown']

local - True

debug - True

Parsing and extraction

from sparrow_parse.extractor.html_extractor import HTMLExtractor

extractor = HTMLExtractor()

answer, targets_unprocessed = extractor.read_data(
        target_columns,  # list of table columns data to fetch
        data, # list of HTML tables
        column_keywords,  # list of valid column names, can be empty. Useful to filter junk content
        group_by_rows,  # JSON result grouping
        update_targets,  # Set to true, if page contains multiple tables with the same columns
        local,  # True if running from CLI, or False if running from FastAPI
        debug)  # Debug

Example:

target_columns - ['description', 'qty', 'net_price', 'net_worth', 'vat', 'gross_worth']

data - list of HTML tables

column_keywords - None

group_by_rows - True

update_targets - True

local - True

debug - True

PDF optimization

from sparrow_parse.extractor.pdf_optimizer import PDFOptimizer

pdf_optimizer = PDFOptimizer()

num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(file_path,
                                                                     output_directory,
                                                                     convert_to_images)

Example:

file_path - /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf

output_directory - set to not None, for debug purposes only

convert_to_images - default False, to split into PDF files

Library build

Create Python virtual environment

python -m venv .env_sparrow_parse

Install Python libraries

pip install -r requirements.txt

Build package

pip install setuptools wheel
python setup.py sdist bdist_wheel

Upload to PyPI

pip install twine
twine upload dist/*

Commercial usage

Sparrow is available under the GPL 3.0 license, promoting freedom to use, modify, and distribute the software while ensuring any modifications remain open source under the same license. This aligns with our commitment to supporting the open-source community and fostering collaboration.

Additionally, we recognize the diverse needs of organizations, including small to medium-sized enterprises (SMEs). Therefore, Sparrow is also offered for free commercial use to organizations with gross revenue below $5 million USD in the past 12 months, enabling them to leverage Sparrow without the financial burden often associated with high-quality software solutions.

For businesses that exceed this revenue threshold or require usage terms not accommodated by the GPL 3.0 license—such as integrating Sparrow into proprietary software without the obligation to disclose source code modifications—we offer dual licensing options. Dual licensing allows Sparrow to be used under a separate proprietary license, offering greater flexibility for commercial applications and proprietary integrations. This model supports both the project's sustainability and the business's needs for confidentiality and customization.

If your organization is seeking to utilize Sparrow under a proprietary license, or if you are interested in custom workflows, consulting services, or dedicated support and maintenance options, please contact us at abaranovskis@redsamuraiconsulting.com. We're here to provide tailored solutions that meet your unique requirements, ensuring you can maximize the benefits of Sparrow for your projects and workflows.

Author

Katana ML, Andrej Baranovskij

License

Licensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. Copy of the license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparrow-parse-0.3.3.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

sparrow_parse-0.3.3-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file sparrow-parse-0.3.3.tar.gz.

File metadata

  • Download URL: sparrow-parse-0.3.3.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.4

File hashes

Hashes for sparrow-parse-0.3.3.tar.gz
Algorithm Hash digest
SHA256 e2093772c42ed8436f690f9f477f7e8df16e03e059848243b3f3eb867485ae1c
MD5 6fc7dd7da62cc041b4b6512c5096928e
BLAKE2b-256 37cda505d25bbcaa88744d99579db89bc3fb7fe026e124b94050c00d9d88ba8e

See more details on using hashes here.

File details

Details for the file sparrow_parse-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for sparrow_parse-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8a6008f436718a969219615467febc128b42f44c392145349952d661b90d4225
MD5 5ade68f3f11e556ac702f006440f8289
BLAKE2b-256 b76334351d410d5cae70189e59001031218d5645bde779e83ad4ae917fd6fe3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page