Skip to main content

A library that prepares raw documents for downstream ML tasks.

Project description

Open-Source Pre-Processing Tools for Unstructured Data

The unstructured_api_tools library includes utilities for converting pipeline notebooks into REST API applications. unstructured_api_tools is intended for use in conjunction with pipeline repos. See pipeline-sec-filings for an example of a repo that uses unstructured_api_tools.

Installation

To install the library, run pip install unstructured_api_tools.

Developer Quick Start

  • Using pyenv to manage virtualenv's is recommended

    • Mac install instructions. See here for more detailed instructions.
      • brew install pyenv-virtualenv
      • pyenv install 3.8.15
    • Linux instructions are available here.
  • Create a virtualenv to work in and activate it, e.g. for one named unstructured_api_tools:

    pyenv virtualenv 3.8.15 unstructured_api_tools
    pyenv activate unstructured_api_tools

  • Run make install-project-local

Usage

Use the CLI command to convert pipeline notebooks to scripts, for example:

unstructured_api_tools convert-pipeline-notebooks \
  --input-directory pipeline-family-sec-filings/pipeline-notebooks \
  --output-directory pipeline-family-sec-filings/prepline_sec_filings/api \
  --pipeline-family sec-filings \
  --semver 0.2.1

If you do not provide the pipeline-family and semver arguments, those values are parsed from preprocessing-pipeline-family.yaml. You can provide the preprocessing-pipeline-family.yaml file explicitly with --config-filename or the PIPELINE_FAMILY_CONFIG environment variable. If neither of those is specified, the fallback is to use the preprocessing-pipeline-family.yaml file in the current working directory.

The API file undergoes black, flake8 and mypy checks after being generated. If you want flake8 to ignore specific errors, you can specify them through the CLI with --flake8-ignore F401, E402. See the flake8 docs for a full list of error codes.

Conversion from pipeline_api to FastAPI

The command described in Usage generates a FastAPI API route for each pipeline_api function defined in the notebook. The signature of the pipeline_api method determines what parameters the generated FastAPI accepts.

Currently, only plain text file uploads are supported and as such the first argument must always be text, but support for multiple files and binary files is coming soon!

In addition, any number of string array parameters may be specified. Any kwarg beginning with m_ indicates a multi-value string parameter that is accepted by the FastAPI API.

For example, in a notebook containing:

def pipeline_api(text, m_subject=[], m_name=[]):

text represents the content of a file posted to the FastAPI API, and the m_subject and m_name keyword args represent optional parameters that may be posted to the API as well, both allowing multiple string parameters. A curl request against such an API could look like this:

curl -X 'POST' \
  'https://<hostname>/<pipeline-family-name>/<pipeline-family-version>/<api-name>' \
  -H 'accept: application/json'  \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@file-to-process.txt' \
  -F 'subject=art' \
  -F 'subject=history'
  -F 'subject=math' \
  -F 'name=feynman'

In addition, you can specify the response type if pipeline_api can support both "application/json" and "text/csv" as return types.

For example, in a notebook containing a kwarg response_type:

def pipeline_api(text, response_type="text/csv", m_subject=[], m_name=[]):

The consumer of the API may then specify "text/csv" as the requested response content type with the usual HTTP Accept header, e.g. Accept: application/json or Accept: text/csv.

Security Policy

See our security policy for information on how to report security vulnerabilities.

Learn more

Section Description
Company Website Unstructured.io product and company info

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unstructured_api_tools-0.10.10.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unstructured_api_tools-0.10.10-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file unstructured_api_tools-0.10.10.tar.gz.

File metadata

  • Download URL: unstructured_api_tools-0.10.10.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for unstructured_api_tools-0.10.10.tar.gz
Algorithm Hash digest
SHA256 d0534a2783c41f2273e48f1c2fa0a22c07c907b5077e6c32b9a879f333706177
MD5 8b2857afa03748af55e9d8ba9ba110de
BLAKE2b-256 5dc4281964b218a4c952d749df5edfad6dca7fb13647a7366f8cbf414668e44e

See more details on using hashes here.

File details

Details for the file unstructured_api_tools-0.10.10-py3-none-any.whl.

File metadata

File hashes

Hashes for unstructured_api_tools-0.10.10-py3-none-any.whl
Algorithm Hash digest
SHA256 97ded698d8cafb33065dc9408815a97774a1baa88085f06452c90d385101b598
MD5 c94ae8bdae1982c180d5bf728c1663c0
BLAKE2b-256 200e807758bb323bbd6f298a8beae335fb1ebf41d5440e212e4e7a1e0f3bad32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page