Skip to main content

Konfuzio Software Development Kit

Project description

Konfuzio SDK

Downloads

The Konfuzio Software Development Kit (Konfuzio SDK) provides a Python API to interact with the Konfuzio Server.

Features

The SDK allows you to retrieve visual and text features to build your own document models. Konfuzio Server serves as an UI to define the data structure, manage training/test data and to deploy your models as API.

Function Public Host Free* On-Site (Paid)
OCR Text :heavy_check_mark: :heavy_check_mark:
OCR Handwriting :heavy_check_mark: :heavy_check_mark:
Text Annotation :heavy_check_mark: :heavy_check_mark:
PDF Annotation :heavy_check_mark: :heavy_check_mark:
Image Annotation :heavy_check_mark: :heavy_check_mark:
Table Annotation :heavy_check_mark: :heavy_check_mark:
Download HOCR :heavy_check_mark: :heavy_check_mark:
Download Images :heavy_check_mark: :heavy_check_mark:
Download PDF with OCR :heavy_check_mark: :heavy_check_mark:
Deploy AI models :heavy_multiplication_x: :heavy_check_mark:

* Under fair use policy: We will impose 10 pages/hour throttling eventually.

Installation

As developer register on our public HOST for free: https://app.konfuzio.com

Then you can use pip to install Konfuzio SDK and run init:

pip install konfuzio_sdk

konfuzio_sdk init

The init will create a Token to connect to the Konfuzio Server. This will create variables KONFUZIO_USER, KONFUZIO_TOKEN and KONFUZIO_HOST in an .env file in your working directory.

Find the full installation guide here or setup PyCharm as described here.

Basics

from konfuzio_sdk.data import Project, Document

# Initialize the project:
my_project = Project(id_='YOUR_PROJECT_ID')

# Get any project online
doc: Document = my_project.get_document_by_id('DOCUMENT_ID_ONLNIE')

# Get the Annotations in a Document
doc.annotations()

# Filter Annotations by Label
label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')
doc.annotations(label=label)

# Or get all Annotations that belong to one Label
label.annotations

# Force a project update. To save time documents will only be updated if they have changed.
my_project.update()

Find more explanations in the Examples.

Regex

Pro Tip: Read our technical blog post Automated Regex to find out how we use Regex to detect outliers in our annotated data.

from konfuzio_sdk.regex import suggest_regex_for_string
from konfuzio_sdk.data import Project, Label

my_project = Project(id_='YOUR_PROJECT_ID')
label: Label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')

# Get Regex tokens to capture (nearly) all annotations of this Label
tokens = label.tokens()
assert tokens == [
    "(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)",
    "(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d)"
]

# Get optimize regex (Can be multiple if multiple create a higher accuracy than a single one)
label_regex = label.regex()
assert label_regex == [
    "[ ]+(?:(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)|(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d))\n"
]

# Suggest a RegEx for a string without optimization
regex = suggest_regex_for_string('Date: 20.05.2022')
assert regex == 'Date:[ ]+\\d\\d\\.\\d\\d\\.\\d\\d\\d\\d'

Tokenizer

Create a Tokenizer based on a Regex and evaluate it on a Document level.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer

my_project = Project(id_='YOUR_PROJECT_ID')
document = my_project.get_document_by_id(document_id='YOUR_DOCUMENT_ID')

# Define the Regex expression
regex = r'[^ \n\t\f]+'

# Build a Tokenizer based on Regex 
tokenizer = RegexTokenizer(regex=regex)
assert tokenizer.regex == regex

# Evaluate the Tokenizer in a Document
evaluation = tokenizer.evaluate(document)

# Ratio of correct Spans found by the Tokenizer in the Document
ratio_of_spans_found = evaluation.is_found_by_tokenizer.sum() / evaluation.is_correct.sum()

Add visual features to text

Calculate the bounding box of a Span using the start and end character.

from pprint import pprint

from konfuzio_sdk.data import Project, LabelSet, Label, AnnotationSet, Annotation, Span
import os

OFFLINE_PROJECT = os.path.join("tests", "example_project_data")

my_project = Project(id_=None, project_folder=OFFLINE_PROJECT)  # use offline data and don't connect to Server
document = my_project.get_document_by_id(44823)
label = Label(project=my_project)
label_set = LabelSet(project=my_project, categories=[document.category])
annotation_set = AnnotationSet(document=document, label_set=label_set)

span = Span(start_offset=60, end_offset=65)

annotation = Annotation(
    label=label,
    annotation_set=annotation_set,
    label_set=label_set,
    document=document,
    spans=[span],
)

span_with_bbox_information = span.bbox()

pprint(span_with_bbox_information.__dict__)
{'_line_index': None,
 '_page_index': 0,
 'annotation': Annotation (None) None (60, 65),
 'bottom': 32.849,
 'end_offset': 65,
 'id_local': 74,
 'start_offset': 60,
 'top': 23.849,
 'x0': 426.0,
 'x1': 442.8,
 'y0': 808.831,
 'y1': 817.831}

CLI

We provide the basic function to create a new Project via CLI:

konfuzio_sdk create_project YOUR_PROJECT_NAME

You will see "Project {YOUR_PROJECT_NAME} (ID {YOUR_PROJECT_ID}) was created successfully!" printed.

And download any project via the id:

konfuzio_sdk download_data YOUR_PROJECT_ID

Tutorials

The Konfuzio Server Tutorial:

Watch the video

References

Supported CRUD Operations

data structure Create/Upload Edit Update (sync) Delete
Project yes x yes x
Document yes yes yes only local
Label yes x x x
Annotation yes x x yes
Label set x x x x
Annotation set x x x x
Category x x x x

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konfuzio_sdk-0.2.4.dev20220720115933.tar.gz (96.4 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file konfuzio_sdk-0.2.4.dev20220720115933.tar.gz.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.4.dev20220720115933.tar.gz
Algorithm Hash digest
SHA256 d08b241e3ca5a5af6bee8573aa081fcf1a57569c0ef82748a72f1037eaf31ba5
MD5 a78e2d78f3938ca82040ebe001dd1b6e
BLAKE2b-256 f3eac1eef4ef886d3e5da8c2db0b62353c19bae96b39968d4735b35d93b3cb12

See more details on using hashes here.

File details

Details for the file konfuzio_sdk-0.2.4.dev20220720115933-py3-none-any.whl.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.4.dev20220720115933-py3-none-any.whl
Algorithm Hash digest
SHA256 4860f550c3b4054cf41442836ede47344ee929899df61fb94155c6bd274644d8
MD5 def8bdb0144245c03a01110d733425e2
BLAKE2b-256 e98c537aef2d4234c47211b42b51c803ef12ccf5677c18c42b1b230d95345042

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page