Skip to main content

Konfuzio Software Development Kit

Project description

Konfuzio SDK

Downloads

The Konfuzio Software Development Kit (Konfuzio SDK) provides a Python API to interact with the Konfuzio Server.

Features

The SDK allows you to retrieve visual and text features to build your own document models. Konfuzio Server serves as an UI to define the data structure, manage training/test data and to deploy your models as API.

Function Public Host Free* On-Site (Paid)
OCR Text :heavy_check_mark: :heavy_check_mark:
OCR Handwriting :heavy_check_mark: :heavy_check_mark:
Text Annotation :heavy_check_mark: :heavy_check_mark:
PDF Annotation :heavy_check_mark: :heavy_check_mark:
Image Annotation :heavy_check_mark: :heavy_check_mark:
Table Annotation :heavy_check_mark: :heavy_check_mark:
Download HOCR :heavy_check_mark: :heavy_check_mark:
Download Images :heavy_check_mark: :heavy_check_mark:
Download PDF with OCR :heavy_check_mark: :heavy_check_mark:
Deploy AI models :heavy_multiplication_x: :heavy_check_mark:

* Under fair use policy: We will impose 10 pages/hour throttling eventually.

Installation

As developer register on our public HOST for free: https://app.konfuzio.com

Then you can use pip to install Konfuzio SDK and run init:

pip install konfuzio_sdk

konfuzio_sdk init

The init will create a Token to connect to the Konfuzio Server. This will create variables KONFUZIO_USER, KONFUZIO_TOKEN and KONFUZIO_HOST in an .env file in your working directory.

Find the full installation guide here or setup PyCharm as described here.

Basics

from konfuzio_sdk.data import Project, Document

# Initialize the project:
my_project = Project(id_='YOUR_PROJECT_ID')

# Get any project online
doc: Document = my_project.get_document_by_id('DOCUMENT_ID_ONLNIE')

# Get the Annotations in a Document
doc.annotations()

# Filter Annotations by Label
label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')
doc.annotations(label=label)

# Or get all Annotations that belong to one Label
label.annotations

# Force a project update. To save time documents will only be updated if they have changed.
my_project.update()

Find more explanations in the Examples.

Regex

Pro Tip: Read our technical blog post Automated Regex to find out how we use Regex to detect outliers in our annotated data.

from konfuzio_sdk.regex import suggest_regex_for_string
from konfuzio_sdk.data import Project, Label

my_project = Project(id_='YOUR_PROJECT_ID')
label: Label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')

# Get Regex tokens to capture (nearly) all annotations of this Label
tokens = label.tokens()
assert tokens == [
    "(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)",
    "(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d)"
]

# Get optimize regex (Can be multiple if multiple create a higher accuracy than a single one)
label_regex = label.regex()
assert label_regex == [
    "[ ]+(?:(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)|(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d))\n"
]

# Suggest a RegEx for a string without optimization
regex = suggest_regex_for_string('Date: 20.05.2022')
assert regex == 'Date:[ ]+\\d\\d\\.\\d\\d\\.\\d\\d\\d\\d'

Tokenizer

Create a Tokenizer based on a Regex and evaluate it on a Document level.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer

my_project = Project(id_='YOUR_PROJECT_ID')
document = my_project.get_document_by_id(document_id='YOUR_DOCUMENT_ID')

# Define the Regex expression
regex = r'[^ \n\t\f]+'

# Build a Tokenizer based on Regex 
tokenizer = RegexTokenizer(regex=regex)
assert tokenizer.regex == regex

# Evaluate the Tokenizer in a Document
evaluation = tokenizer.evaluate(document)

# Ratio of correct Spans found by the Tokenizer in the Document
ratio_of_spans_found = evaluation.is_found_by_tokenizer.sum() / evaluation.is_correct.sum()

Add visual features to text

Calculate the bounding box of a Span using the start and end character.

from pprint import pprint

from konfuzio_sdk.data import Project, LabelSet, Label, AnnotationSet, Annotation, Span
import os

OFFLINE_PROJECT = os.path.join("tests", "example_project_data")

my_project = Project(id_=None, project_folder=OFFLINE_PROJECT)  # use offline data and don't connect to Server
document = my_project.get_document_by_id(44823)
label = Label(project=my_project)
label_set = LabelSet(project=my_project, categories=[document.category])
annotation_set = AnnotationSet(document=document, label_set=label_set)

span = Span(start_offset=60, end_offset=65)

annotation = Annotation(
    label=label,
    annotation_set=annotation_set,
    label_set=label_set,
    document=document,
    spans=[span],
)

span_with_bbox_information = span.bbox()

pprint(span_with_bbox_information.__dict__)
{'_line_index': None,
 '_page_index': 0,
 'annotation': Annotation (None) None (60, 65),
 'bottom': 32.849,
 'end_offset': 65,
 'id_local': 74,
 'start_offset': 60,
 'top': 23.849,
 'x0': 426.0,
 'x1': 442.8,
 'y0': 808.831,
 'y1': 817.831}

CLI

We provide the basic function to create a new Project via CLI:

konfuzio_sdk create_project YOUR_PROJECT_NAME

You will see "Project {YOUR_PROJECT_NAME} (ID {YOUR_PROJECT_ID}) was created successfully!" printed.

And download any project via the id:

konfuzio_sdk download_data YOUR_PROJECT_ID

Tutorials

The Konfuzio Server Tutorial:

Watch the video

References

Supported CRUD Operations

data structure Create/Upload Edit Update (sync) Delete
Project yes x yes x
Document yes yes yes only local
Label yes x x x
Annotation yes x x yes
Label set x x x x
Annotation set x x x x
Category x x x x

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konfuzio_sdk-0.2.6.dev20221112180826.tar.gz (98.2 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file konfuzio_sdk-0.2.6.dev20221112180826.tar.gz.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.6.dev20221112180826.tar.gz
Algorithm Hash digest
SHA256 d9a6589a75c1fee905f5baf47ad59fa76cf19234c86ec7aeccaa4a820c5f6fdb
MD5 add0ce25d01ea29a5700fbd1c9175d60
BLAKE2b-256 16b39c68a49dd588269e21beb85308ac6424372c29636775a01548015c3a9ef9

See more details on using hashes here.

File details

Details for the file konfuzio_sdk-0.2.6.dev20221112180826-py3-none-any.whl.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.6.dev20221112180826-py3-none-any.whl
Algorithm Hash digest
SHA256 b77dcf7e3bb57280ae23cca73a38fb26cb9e67a923d0ac0e77c5da15547230d1
MD5 bc5e88dde1732d328ae2af1738b05b29
BLAKE2b-256 ddf8f1ca84747d61f8574748eaa95c8fc51f356d77e3b210350b5c6631ce86dd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page