Skip to main content

Konfuzio Software Development Kit

Project description

Konfuzio SDK

Downloads

The Konfuzio Software Development Kit (Konfuzio SDK) provides a Python API to interact with the Konfuzio Server.

Features

The SDK allows you to retrieve visual and text features to build your own document models. Konfuzio Server serves as an UI to define the data structure, manage training/test data and to deploy your models as API.

Function Public Host Free* On-Site (Paid)
OCR Text :heavy_check_mark: :heavy_check_mark:
OCR Handwriting :heavy_check_mark: :heavy_check_mark:
Text Annotation :heavy_check_mark: :heavy_check_mark:
PDF Annotation :heavy_check_mark: :heavy_check_mark:
Image Annotation :heavy_check_mark: :heavy_check_mark:
Table Annotation :heavy_check_mark: :heavy_check_mark:
Download HOCR :heavy_check_mark: :heavy_check_mark:
Download Images :heavy_check_mark: :heavy_check_mark:
Download PDF with OCR :heavy_check_mark: :heavy_check_mark:
Deploy AI models :heavy_multiplication_x: :heavy_check_mark:

* Under fair use policy: We will impose 10 pages/hour throttling eventually.

Installation

As developer register on our public HOST for free: https://app.konfuzio.com

Then you can use pip to install Konfuzio SDK and run init:

pip install konfuzio_sdk

konfuzio_sdk init

The init will create a Token to connect to the Konfuzio Server. This will create variables KONFUZIO_USER, KONFUZIO_TOKEN and KONFUZIO_HOST in an .env file in your working directory.

Find the full installation guide here or setup PyCharm as described here.

Basics

from konfuzio_sdk.data import Project, Document

# Initialize the project:
my_project = Project(id_='YOUR_PROJECT_ID')

# Get any project online
doc: Document = my_project.get_document_by_id('DOCUMENT_ID_ONLNIE')

# Get the Annotations in a Document
doc.annotations()

# Filter Annotations by Label
label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')
doc.annotations(label=label)

# Or get all Annotations that belong to one Label
label.annotations

# Force a project update. To save time documents will only be updated if they have changed.
my_project.update()

Find more explanations in the Examples.

Regex

Pro Tip: Read our technical blog post Automated Regex to find out how we use Regex to detect outliers in our annotated data.

from konfuzio_sdk.regex import suggest_regex_for_string
from konfuzio_sdk.data import Project, Label

my_project = Project(id_='YOUR_PROJECT_ID')
label: Label = my_project.get_label_by_name('MY_OWN_LABEL_NAME')

# Get Regex tokens to capture (nearly) all annotations of this Label
tokens = label.tokens()
assert tokens == [
    "(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)",
    "(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d)"
]

# Get optimize regex (Can be multiple if multiple create a higher accuracy than a single one)
label_regex = label.regex()
assert label_regex == [
    "[ ]+(?:(?P<GesamtBrutto_N_4420363_2111>\\d\\.\\d\\d\\d\\,\\d\\d)|(?P<GesamtBrutto_N_9812334_1498>\\d\\d\\d\\,\\d\\d))\n"
]

# Suggest a RegEx for a string without optimization
regex = suggest_regex_for_string('Date: 20.05.2022')
assert regex == 'Date:[ ]+\\d\\d\\.\\d\\d\\.\\d\\d\\d\\d'

Tokenizer

Create a Tokenizer based on a Regex and evaluate it on a Document level.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer

my_project = Project(id_='YOUR_PROJECT_ID')
document = my_project.get_document_by_id(document_id='YOUR_DOCUMENT_ID')

# Define the Regex expression
regex = r'[^ \n\t\f]+'

# Build a Tokenizer based on Regex 
tokenizer = RegexTokenizer(regex=regex)
assert tokenizer.regex == regex

# Evaluate the Tokenizer in a Document
evaluation = tokenizer.evaluate(document)

# Ratio of correct Spans found by the Tokenizer in the Document
ratio_of_spans_found = evaluation.is_found_by_tokenizer.sum() / evaluation.is_correct.sum()

Add visual features to text

Calculate the bounding box of a Span using the start and end character.

from pprint import pprint

from konfuzio_sdk.data import Project, LabelSet, Label, AnnotationSet, Annotation, Span
import os

OFFLINE_PROJECT = os.path.join("tests", "example_project_data")

my_project = Project(id_=None, project_folder=OFFLINE_PROJECT)  # use offline data and don't connect to Server
document = my_project.get_document_by_id(44823)
label = Label(project=my_project)
label_set = LabelSet(project=my_project, categories=[document.category])
annotation_set = AnnotationSet(document=document, label_set=label_set)

span = Span(start_offset=60, end_offset=65)

annotation = Annotation(
    label=label,
    annotation_set=annotation_set,
    label_set=label_set,
    document=document,
    spans=[span],
)

span_with_bbox_information = span.bbox()

pprint(span_with_bbox_information.__dict__)
{'_line_index': None,
 '_page_index': 0,
 'annotation': Annotation (None) None (60, 65),
 'bottom': 32.849,
 'end_offset': 65,
 'id_local': 74,
 'start_offset': 60,
 'top': 23.849,
 'x0': 426.0,
 'x1': 442.8,
 'y0': 808.831,
 'y1': 817.831}

CLI

We provide the basic function to create a new Project via CLI:

konfuzio_sdk create_project YOUR_PROJECT_NAME

You will see "Project {YOUR_PROJECT_NAME} (ID {YOUR_PROJECT_ID}) was created successfully!" printed.

And download any project via the id:

konfuzio_sdk download_data YOUR_PROJECT_ID

Tutorials

The Konfuzio Server Tutorial:

Watch the video

References

Supported CRUD Operations

data structure Create/Upload Edit Update (sync) Delete
Project yes x yes x
Document yes yes yes only local
Label yes x x x
Annotation yes x x yes
Label set x x x x
Annotation set x x x x
Category x x x x

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konfuzio_sdk-0.2.4.dev20220715094813.tar.gz (95.7 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file konfuzio_sdk-0.2.4.dev20220715094813.tar.gz.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.4.dev20220715094813.tar.gz
Algorithm Hash digest
SHA256 adc8c75f42c50f7c8f5d04b6c1d6286a7a9072b533827bad3b7caf5daa2ce443
MD5 caf2b5f970d6e081d2bb061ffea0b7cd
BLAKE2b-256 355f0b8ee1abe5644dc81c785ce491546b0c304976efcb8ab5f117e1711ba1de

See more details on using hashes here.

File details

Details for the file konfuzio_sdk-0.2.4.dev20220715094813-py3-none-any.whl.

File metadata

File hashes

Hashes for konfuzio_sdk-0.2.4.dev20220715094813-py3-none-any.whl
Algorithm Hash digest
SHA256 cb75a7c11d97577a317db6566d62d9c2137393b5e2f95467d6256e332630d305
MD5 8aa6a3beceebdc79aba7d17191c50e5c
BLAKE2b-256 99bc77e8b21a36ce9cc79be7dce39f5e6f551749c4b974506b222432b58ac3b3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page