Skip to main content

Konfuzio Software Development Kit

Project description

Konfuzio SDK

Downloads

The Konfuzio Software Development Kit (Konfuzio SDK) provides a Python API to interact with the Konfuzio Server.

Features

The SDK allows you to retrieve visual and text features to build your own document models. Konfuzio Server serves as an UI to define the data structure, manage training/test data and to deploy your models as API.

Function Public Host Free* On-Site (Paid)
OCR Text :heavy_check_mark: :heavy_check_mark:
OCR Handwriting :heavy_check_mark: :heavy_check_mark:
Text Annotation :heavy_check_mark: :heavy_check_mark:
PDF Annotation :heavy_check_mark: :heavy_check_mark:
Image Annotation :heavy_check_mark: :heavy_check_mark:
Table Annotation :heavy_check_mark: :heavy_check_mark:
Download HOCR :heavy_check_mark: :heavy_check_mark:
Download Images :heavy_check_mark: :heavy_check_mark:
Download PDF with OCR :heavy_check_mark: :heavy_check_mark:
Deploy AI models :heavy_multiplication_x: :heavy_check_mark:

* Under fair use policy: We will impose 10 pages/hour throttling eventually.

Installation

As developer register on our public HOST for free: https://app.konfuzio.com

Then you can use pip to install Konfuzio SDK and run init:

pip install konfuzio_sdk

konfuzio_sdk init

The init will create a Token to connect to the Konfuzio Server. This will create variables KONFUZIO_USER, KONFUZIO_TOKEN and KONFUZIO_HOST in an .env file in your working directory.

Find the full installation guide here or setup PyCharm as described here.

Basics

from konfuzio_sdk.data import Project, Document

# Initialize the project:
my_project = Project(id_='YOUR_PROJECT_ID')

# Get any project online
doc: Document = my_project.get_document_by_id('DOCUMENT_ID_ONLNIE')

# Get the Annotations in a Document
doc.annotations()

# Filter Annotations by Label
label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')
doc.annotations(label=label)

# Or get all Annotations that belong to one Label
label.annotations

# Force a project update. To save time documents will only be updated if they have changed.
my_project.update()

Find more explanations in the Examples.

Regex

Pro Tip: Read our technical blog post Automated Regex to find out how we use Regex to detect outliers in our annotated data.

from konfuzio_sdk.regex import suggest_regex_for_string
from konfuzio_sdk.data import Project, Label

my_project = Project(id_='YOUR_PROJECT_ID')
label: Label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')

# Get Regex tokens to capture (nearly) all annotations of this Label
tokens = label.tokens()
assert tokens == [
    "(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)",
    "(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d)"
]

# Get optimize regex (Can be multiple if multiple create a higher accuracy than a single one)
label_regex = label.regex()
assert label_regex == [
    "[ ]+(?:(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)|(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d))\n"
]

# Suggest a RegEx for a string without optimization
regex = suggest_regex_for_string('Date: 20.05.2022')
assert regex == 'Date:[ ]+\\d\\d\\.\\d\\d\\.\\d\\d\\d\\d'

Tokenizer

Create a Tokenizer based on a Regex and evaluate it on a Document level.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer

my_project = Project(id_='YOUR_PROJECT_ID')
document = my_project.get_document_by_id(document_id='YOUR_DOCUMENT_ID')

# Define the Regex expression
regex = r'[^ \n\t\f]+'

# Build a Tokenizer based on Regex 
tokenizer = RegexTokenizer(regex=regex)
assert tokenizer.regex == regex

# Evaluate the Tokenizer in a Document
evaluation = tokenizer.evaluate(document)

# Ratio of correct Spans found by the Tokenizer in the Document
ratio_of_spans_found = evaluation.is_found_by_tokenizer.sum() / evaluation.is_correct.sum()

Add visual features to text

Calculate the bounding box of a Span using the start and end character.

from pprint import pprint

from konfuzio_sdk.data import Project, LabelSet, Label, AnnotationSet, Annotation, Span
import os

OFFLINE_PROJECT = os.path.join("tests", "example_project_data")

my_project = Project(id_=None, project_folder=OFFLINE_PROJECT)  # use offline data and don't connect to Server
document = my_project.get_document_by_id(44823)
label = Label(project=my_project)
label_set = LabelSet(project=my_project, categories=[document.category])
annotation_set = AnnotationSet(document=document, label_set=label_set)

span = Span(start_offset=60, end_offset=65)

annotation = Annotation(
    label=label,
    annotation_set=annotation_set,
    label_set=label_set,
    document=document,
    spans=[span],
)

span_with_bbox_information = span.bbox()

pprint(span_with_bbox_information.__dict__)
{'_line_index': None,
 '_page_index': 0,
 'annotation': Annotation (None) None (60, 65),
 'bottom': 32.849,
 'end_offset': 65,
 'id_local': 74,
 'start_offset': 60,
 'top': 23.849,
 'x0': 426.0,
 'x1': 442.8,
 'y0': 808.831,
 'y1': 817.831}

CLI

We provide the basic function to create a new Project via CLI:

konfuzio_sdk create_project YOUR_PROJECT_NAME

You will see "Project {YOUR_PROJECT_NAME} (ID {YOUR_PROJECT_ID}) was created successfully!" printed.

And download any project via the id:

konfuzio_sdk download_data YOUR_PROJECT_ID

Tutorials

The Konfuzio Server Tutorial:

Watch the video

References

Supported CRUD Operations

data structure Create/Upload Edit Update (sync) Delete
Project yes x yes x
Document yes yes yes only local
Label yes x x x
Annotation yes x x yes
Label set x x x x
Annotation set x x x x
Category x x x x

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konfuzio_sdk-0.2.11.dev20230130152726.tar.gz (105.8 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file konfuzio_sdk-0.2.11.dev20230130152726.tar.gz.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.11.dev20230130152726.tar.gz
Algorithm Hash digest
SHA256 7f135c54d0376d79c5fa28291762b72c345b568ed295da9b751294c512265cfc
MD5 29da443e53a9132cb7ef1d3b58481e06
BLAKE2b-256 b829c9da7c9120cb6196ad90418ea85f4651fef716ff5612b92091313be27af6

See more details on using hashes here.

File details

Details for the file konfuzio_sdk-0.2.11.dev20230130152726-py3-none-any.whl.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.11.dev20230130152726-py3-none-any.whl
Algorithm Hash digest
SHA256 cf17ff5916edb35f03940f121e5fcc0fd5a63a510c3676aada9118dedc16155a
MD5 b1c1f8220327afd82f98776c64fedee4
BLAKE2b-256 388b574ad37df33bd48c436b601b21643e2cf0c7a9ef5a346dd361401be0ffbe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page