MediaPipe Tasks API wrapper for Python computer vision

These details have not been verified by PyPI

Project description

OpenVisionKit

OpenVisionKit is a high-level Python computer vision library built on top of MediaPipe and OpenCV. It provides production-ready detectors and segmentation utilities for face detection, face mesh, hand tracking, pose estimation, object detection, and background segmentation — wrapped in clean, developer-friendly APIs that eliminate boilerplate and let you focus on building.

Whether you are prototyping a gesture-controlled application, building a fitness tracker, adding AR effects, or conducting research, OpenVisionKit gives you the tools to go from camera frame to structured detections in a few lines of code.

Features

Module	Capability
`FaceDetector`	Bounding boxes, 6-point keypoints, confidence filtering, IoU, face cropping
`FaceMeshDetector`	478 landmarks, blendshapes, head pose (yaw/pitch/roll), gaze direction, emotion, AR overlays
`HandDetector`	21 landmarks, gesture recognition, finger-join detection, distance estimation, palm width
`PoseDetector`	33 body landmarks, joint angle calculation, exercise detection, workout rep counter, segmentation
`ObjectDetector`	EfficientDet-based multi-class detection with bounding boxes and labels
`SelfieSegmentation`	Background removal, blur, replacement, virtual backgrounds, alpha blending
`HairSegmentation`	Hair region segmentation and recoloring
`ScreenCapture`	High-performance screen grabbing via `mss`
`video_capture_template`	Drop-in webcam loop with FPS overlay, recording, and screenshot support
`image_template`	Single-image processing template with auto-centering, resize, and custom logic hook
`TextDetector`	Tesseract OCR with character/word/digit/table detection, NLP entity extraction, image matching, handwriting support

Requirements

Python >= 3.11.8
A .tflite / .task model file for each MediaPipe detector (see Model Downloads)

TextDetector additional requirements

TextDetector uses Tesseract OCR and optional NLP tooling that are not bundled with MediaPipe.

1. Install Tesseract binary (system-level):

# macOS
brew install tesseract

# Ubuntu / Debian
sudo apt-get install tesseract-ocr

# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki

2. Install Python packages:

# pip
pip install pytesseract imutils pandas scikit-image Pillow

# uv
uv add pytesseract imutils pandas scikit-image Pillow

3. (Optional) spaCy for NLP features (entity extraction, keyword extraction, summarization, relation extraction):

pip install spacy
python -m spacy download en_core_web_sm

Without spaCy, all NLP methods return empty results gracefully — the rest of TextDetector works without it.

Installation

pip

pip install openvisionkit

Or install directly from source:

pip install git+https://github.com/your-org/openvisionkit.git

uv

uv add openvisionkit

Or from source:

uv add git+https://github.com/your-org/openvisionkit.git

For development (editable install with all dev dependencies):

git clone https://github.com/your-org/openvisionkit.git
cd openvisionkit
make setup

Model Downloads

OpenVisionKit delegates inference to MediaPipe .tflite / .task model files. Download the models you need and place them in a models/ directory at your project root.

Detector	Model file	Download
`FaceDetector`	`face_detector.tflite`	MediaPipe Face Detector
`FaceMeshDetector`	`face_landmarker_v2_with_blendshapes.task`	MediaPipe Face Landmarker
`HandDetector`	`hand_landmarker.task`	MediaPipe Hand Landmarker
`PoseDetector`	`pose_landmarker.task`	MediaPipe Pose Landmarker
`ObjectDetector`	`efficientdet_lite.tflite`	MediaPipe Object Detector
`SelfieSegmentation`	`deeplab_v3.tflite`	MediaPipe Image Segmenter
`HairSegmentation`	`hair_segmenter.tflite`	MediaPipe Hair Segmenter

Quick Start

import cv2
from openvisionkit.capture.video_template import video_capture_template
from openvisionkit.lib.hand_detector import HandDetector

detector = HandDetector(model_path="./models/hand_landmarker.task")

def process(frame):
    frame = detector.draw_landmarks(frame)
    return frame

video_capture_template(custom_logic=process, window_name="Hand Tracking")

Usage

FaceDetector

Detects faces in an image or video stream and returns bounding boxes, keypoints, and confidence scores.

import cv2
from openvisionkit.lib.face_detector import FaceDetector

detector = FaceDetector(
    model_path="./models/face_detector.tflite",
    max_faces=5,
    running_mode="IMAGE",           # "IMAGE" | "VIDEO"
    min_detection_confidence=0.5,
    min_suppression_threshold=0.3,
)

frame = cv2.imread("photo.jpg")

# Returns annotated frame + list of detection dicts
annotated, detections = detector.detect_faces(frame, to_draw_bounding_box=True, to_draw_landmarks=True)

for det in detections:
    print(det["id"])                    # face index
    print(det["score"])                 # confidence 0–1
    print(det["bbox"])                  # (x, y, w, h)
    print(det["bbox_xyxy"])             # (x1, y1, x2, y2)
    print(det["center"])                # (cx, cy)
    print(det["normalized_keypoints"]) # list of (x, y) pixel coords for 6 landmarks

cv2.imshow("Faces", annotated)
cv2.waitKey(0)

Utility methods:

# Filter detections below a confidence threshold
confident = detector.filter_by_confidence(detections, threshold=0.7)

# Get the largest face by bounding-box area
biggest = detector.get_largest_face(detections)

# Crop face regions out of the image (optional pixel margin)
face_crops = detector.crop_faces(frame, detections, margin=10)

# Sort by area (descending) or any other detection key
sorted_faces = detector.sort_faces(detections, by="area")

# Intersection over Union — useful for NMS or tracking
iou = detector.get_iou(detections[0]["bbox_xyxy"], detections[1]["bbox_xyxy"])

FaceMeshDetector

Detects 478 facial landmarks per face along with blendshape expressions and head-pose matrices.

import cv2
from openvisionkit.lib.face_mesh_detector import FaceMeshDetector

detector = FaceMeshDetector(
    model_path="./models/face_landmarker_v2_with_blendshapes.task",
    num_faces=2,
    min_face_detection_confidence=0.5,
    output_face_blendshapes=True,
    output_facial_transformation_matrixes=True,
)

frame = cv2.imread("face.jpg")

annotated, faces, blendshapes, matrices, bboxes = detector.face_mesh_detection(frame, drawLandMarks=True)

# faces[i]       -> list of [x, y] pixel coords for 478 landmarks
# blendshapes[i] -> dict of {blendshape_name: score}  (52 expressions)
# matrices[i]    -> 4x4 numpy head-pose matrix
# bboxes[i]      -> [min_x, min_y, max_x, max_y]

for i, blend in enumerate(blendshapes):
    # Rule-based emotion from blendshapes
    emotion = detector.get_emotion(blend)
    print(f"Face {i}: {emotion}")

    # Gaze direction for each eye
    gaze = detector.get_eye_gaze_direction(faces[i], is_left_eye=True)
    print(f"Left gaze: {gaze}")   # "Left" | "Center" | "Right"

    # Mouth openness ratio (0 = closed, 0.5+ = wide open)
    ratio = detector.get_mouth_openness_ratio(faces[i])
    print(f"Mouth ratio: {ratio:.2f}")

    # Head pose angles from transformation matrix
    if matrices[i] is not None:
        yaw, pitch, roll = detector.get_head_pose_angles(matrices[i])
        print(f"Yaw: {yaw:.1f}  Pitch: {pitch:.1f}  Roll: {roll:.1f}")

    # Inter-pupillary distance
    ipd = detector.get_inter_pupillary_distance(faces[i], normalized=False)
    print(f"IPD: {ipd:.1f}px")

AR overlay example:

# Overlay a PNG glasses filter (must have alpha channel)
glasses = cv2.imread("glasses.png", cv2.IMREAD_UNCHANGED)   # RGBA
frame_with_glasses = detector.overlay_ar_filter(frame, faces[0], glasses, filter_type="glasses")

HandDetector

Tracks up to N hands with 21 landmarks each. Provides gesture recognition, finger-join detection, and distance estimation.

import cv2
from openvisionkit.lib.hand_detector import HandDetector

detector = HandDetector(
    model_path="./models/hand_landmarker.task",
    running_mode="IMAGE",       # "IMAGE" | "VIDEO"
    max_hands=2,
    detection_confidence=0.5,
    tracking_confidence=0.5,
    smoothing_window=8,
)

frame = cv2.imread("hand.jpg")

# Draw landmarks, bounding box, and handedness label
annotated = detector.draw_landmarks(
    frame,
    to_draw_landmark=True,
    to_draw_bounding_box=True,
    to_put_handle_label=True,
)

# Get structured landmark data for all detected hands
all_hands = detector.get_landmarks(frame)

for hand in all_hands:
    print(hand["hand_type"])          # "Left" or "Right"
    print(hand["bounding_box"])       # (x, y, w, h)
    print(hand["center_point"])       # (cx, cy)
    lm = hand["landmarks_list"]       # list of [id, x, y, z]

    # Which fingers are raised?
    fingers = detector.fingers_up(lm)
    # [thumb, index, middle, ring, little] — 1=up, 0=down

    # Gesture shortcuts
    print(detector.is_fist())
    print(detector.is_thumbs_up())
    print(detector.is_peace_sign())
    print(detector.is_open_hand())

    # Distance between any two landmarks with visual feedback
    p1 = (lm[4][1], lm[4][2])   # thumb tip
    p2 = (lm[8][1], lm[8][2])   # index tip
    length, annotated, coords = detector.get_distance(p1, p2, annotated)
    print(f"Thumb-index distance: {length:.1f}px")

    # Detect if two finger tips are touching
    joined = detector.is_fingers_joined(4, 8, annotated, lm, threshold=0.25)

    # Palm width in pixels (stable reference)
    palm_px, idx_mcp, pinky_mcp = detector.palm_width_px(frame, lm)
    print(f"Palm width: {palm_px:.1f}px")

Distance estimation (calibration-based):

# Provide (palm_width_px, distance_cm) pairs to calibrate
calibration = [(180, 20), (120, 35), (80, 55), (60, 75)]
detector_calibrated = HandDetector(
    model_path="./models/hand_landmarker.task",
    calibration_samples=calibration,
)

# After calibration, estimate distance from a new palm width
distance_cm = detector_calibrated.estimate_distance_cm(palm_width_px=110)
print(f"Estimated distance: {distance_cm:.1f} cm")

PoseDetector

Detects 33 body landmarks. Supports joint angle calculation, exercise classification, workout rep counting, and body segmentation.

import cv2
from openvisionkit.lib.pose_detector import PoseDetector
from mediapipe.tasks.python import vision

detector = PoseDetector(
    model_path="./models/pose_landmarker.task",
    running_mode=vision.RunningMode.VIDEO,   # VIDEO for webcam streams
    num_poses=1,
    min_pose_detection_confidence=0.5,
    output_segmentation_masks=True,
)

cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Detect and annotate
    annotated, result = detector.detect(frame, draw_landmarks=True)

    # All landmark positions as pixel dicts
    landmarks = detector.get_all_postion(frame, result)

    # Get a specific landmark (e.g. nose = id 0)
    nose = detector.get_landmark(result, pose_index=0, landmark_id=0)
    print(nose["x"], nose["y"], nose["visibility"])

    # Calculate joint angle — e.g. left elbow (shoulder=11, elbow=13, wrist=15)
    annotated, angle = detector.calculate_angle(annotated, result, p1=11, p2=13, p3=15)
    print(f"Left elbow angle: {angle:.1f} degrees")

    # Classify current exercise
    exercise = detector.detect_exercise(annotated, result)
    print(f"Exercise: {exercise}")

    # Workout rep counter (tracks bicep curls automatically)
    angle, percent, reps = detector.calculate_workout_percentage()
    stats = detector.get_workout_stats(annotated)
    print(f"Reps: {stats['reps']}  Calories: {stats['calories']:.1f}")

    # Body segmentation overlay (requires output_segmentation_masks=True)
    annotated = detector.draw_segmentation_mask(annotated, result, alpha=0.5, color=(0, 255, 0))

    cv2.imshow("Pose", annotated)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

Auto-select the most visible arm for curl tracking:

p1, p2, p3 = detector.select_active_arm(result)
annotated, angle = detector.calculate_angle(annotated, result, p1, p2, p3)

ObjectDetector

Detects multiple object classes in a frame using EfficientDet Lite.

import cv2
from openvisionkit.lib.object_detector import ObjectDetector

detector = ObjectDetector(
    model_path="./models/efficientdet_lite.tflite",
    max_results=5,
    running_mode="IMAGE",           # "IMAGE" | "VIDEO"
    category_allowlist=None,        # e.g. ["person", "car"] to restrict classes
    category_denylist=None,
)

frame = cv2.imread("street.jpg")

# Returns annotated image with bounding boxes and labels drawn
annotated = detector.detect_objects(frame)

cv2.imshow("Objects", annotated)
cv2.waitKey(0)

# Or get raw detection result for custom processing
result, mp_image = detector.detect(frame)
for detection in result.detections:
    label = detection.categories[0].category_name
    score = detection.categories[0].score
    bbox  = detection.bounding_box
    print(f"{label}: {score:.2f} @ ({bbox.origin_x}, {bbox.origin_y})")

SelfieSegmentation

Separates people from backgrounds using DeepLab V3. Multiple compositing modes available.

import cv2
from openvisionkit.lib.selfie_segmentation import SelfieSegmentation

seg = SelfieSegmentation(
    model_path="./models/deeplab_v3.tflite",
    output_category_mask=True,
)

frame = cv2.imread("selfie.jpg")

# Remove background (black fill)
no_bg = seg.remove_background(frame)

# Blur background
blurred = seg.blur_background(frame, blur_strength=(55, 55))

# Replace background with an image
replaced = seg.replace_background(frame, background_path="./bg.jpg")

# Solid color background
colored = seg.color_background(frame, color=(0, 120, 255))

# Alpha-blend foreground over a custom background array
bg = cv2.imread("./bg.jpg")
blended = seg.alpha_blend(frame, bg)

# Optimized virtual background with temporal smoothing + edge refinement
# (best for real-time webcam use)
output = seg.optimize_virtual_background(frame, bg)

# Single-person isolation — removes other people in the background
output = seg.optimize_virtual_background_improved(frame, bg)

# Debug: visualize the raw segmentation heatmap
heatmap = seg.overlay_mask(frame)

cv2.imshow("Segmented", output)
cv2.waitKey(0)

HairSegmentation

Segments hair regions for recoloring or styling effects.

import cv2
import numpy as np
from openvisionkit.lib.hair_segmentation import HairSegmentation

seg = HairSegmentation(model_path="./models/hair_segmenter.tflite")

frame = cv2.imread("portrait.jpg")
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

result = seg.process(rgb_frame)
mask = result.category_mask.numpy_view()    # shape (H, W), values 0–1

# Recolor hair to blue
hair_color = np.zeros_like(frame)
hair_color[:] = (255, 0, 0)                 # BGR blue
hair_region = (mask > 0.5)[..., None]
output = np.where(hair_region, hair_color, frame)

cv2.imshow("Hair", output)
cv2.waitKey(0)

ScreenCapture

Captures live frames from a monitor — useful for screen-based CV pipelines.

from openvisionkit.capture.screen_capture import ScreenCapture
import cv2

cap = ScreenCapture(monitor_index=1)  # 1 = primary monitor

while True:
    frame = cap.grab()                # returns BGR numpy array
    cv2.imshow("Screen", frame)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cv2.destroyAllWindows()

video_capture_template

A reusable webcam loop that handles window setup, FPS display, recording, and screenshots. Pass a custom_logic callback for your processing.

import cv2
from openvisionkit.capture.video_template import video_capture_template
from openvisionkit.lib.face_detector import FaceDetector

detector = FaceDetector(model_path="./models/face_detector.tflite", running_mode="VIDEO")

def process(frame):
    annotated, _ = detector.detect_faces(frame)
    return annotated

video_capture_template(
    video_source=0,                       # webcam index or path to video file
    custom_logic=process,
    window_name="Face Detection",
    resolution=(1280, 720),
    draw_fps=True,
    enable_auto_recording=True,           # auto-saves .mp4 from first frame
    record_format="mp4",                  # "mp4" or "gif"
    enable_screenshot=True,               # press 's' to capture a frame
    auto_screenshot_after_seconds=10.0,   # also auto-capture after 10 s
    auto_screenshot_repeat=False,         # True = repeat every 10 s
)

Key bindings (built-in):

Key	Action	Condition
`ESC`	Exit loop	always
`s` / `S`	Save screenshot	`enable_screenshot=True`
`r` / `R`	Toggle manual recording on/off	`enable_manual_recording=True`

Stateful key handlers with KeyEventManager:

from openvisionkit.capture.video_template import KeyEventManager, video_capture_template

state = {"score": 0}
km = KeyEventManager()
km.register(ord("p"), lambda frame, s: print(f"Score: {s['score']}"))
km.register(ord("+"), lambda frame, s: s.update({"score": s["score"] + 1}))

video_capture_template(
    video_source=0,
    state=state,
    key_manager=km,
    custom_logic=lambda frame: frame,
)

Manual recording:

video_capture_template(
    video_source=0,
    enable_manual_recording=True,   # press R to start, R again to stop and save
    record_format="gif",
)

Parameter reference:

Parameter	Type	Default	Description
`video_source`	`int \| str`	`0`	Camera index or path to video file
`loop_forever`	`bool`	`True`	Loop video file when it ends
`custom_logic`	`Callable[[ndarray], ndarray]`	`None`	Per-frame processing; receives and returns BGR image
`state`	`dict`	`None`	Shared state dict passed to every key handler
`key_manager`	`KeyEventManager`	`None`	Custom key-event dispatcher
`window_name`	`str`	`"Demo"`	OpenCV window title
`show_window`	`bool`	`True`	Display the OpenCV window
`resolution`	`tuple[int, int]`	`(1280, 720)`	Camera resolution `(width, height)`
`center_window`	`bool`	`True`	Auto-center window on screen via pyautogui
`draw_fps`	`bool`	`True`	Overlay FPS counter on frame
`fps`	`int`	`15`	Recording frame rate (auto-recording only)
`mouse_callback`	`Callable`	`None`	OpenCV mouse-event callback
`mouse_callback_params`	`dict`	`None`	Extra params passed to mouse callback
`enable_auto_recording`	`bool`	`False`	Record every frame automatically from start
`enable_manual_recording`	`bool`	`False`	Allow toggling recording with `R` key
`record_format`	`str`	`"mp4"`	`"mp4"` or `"gif"`
`enable_screenshot`	`bool`	`False`	Enable `s`-key and auto-screenshot
`screenshot_output_dir`	`str`	`"screenshots"`	Directory for saved screenshots
`screenshot_prefix`	`str`	`"capture"`	Filename prefix before timestamp
`auto_screenshot_after_seconds`	`float`	`None`	Trigger first screenshot after N seconds
`auto_screenshot_repeat`	`bool`	`False`	Repeat auto-screenshot every N seconds

image_template

A single-image equivalent of video_capture_template. Loads one image from disk, applies an optional processing callback, resizes to the target resolution, auto-centers the window on screen, and displays it.

import cv2
from openvisionkit.capture.image_template import image_template
from openvisionkit.lib.face_detector import FaceDetector

detector = FaceDetector(model_path="./models/face_detector.tflite", running_mode="IMAGE")

def process(frame):
    annotated, _ = detector.detect_faces(frame)
    return annotated

image_template(
    image_path="photo.jpg",
    custom_logic=process,       # receives the loaded BGR image, must return BGR image
    window_name="Face Demo",
    resolution=(1280, 720),     # image is resized to this before display
    center_window=True,         # auto-centers window on screen via pyautogui
    show_window=True,           # set False to run headless (e.g. save to disk instead)
)

Without a custom_logic callback the image is loaded, resized, and displayed as-is:

image_template(image_path="photo.jpg")

Parameter reference:

Parameter	Type	Default	Description
`image_path`	`str`	required	Path to the image file
`custom_logic`	`Callable[[ndarray], ndarray]`	`None`	Processing function applied before display
`window_name`	`str`	`"Demo"`	OpenCV window title
`resolution`	`tuple[int, int]`	`(1280, 720)`	`(width, height)` to resize the image
`center_window`	`bool`	`True`	Move window to screen center via pyautogui
`show_window`	`bool`	`True`	Display the OpenCV window

TextDetector

Tesseract-backed OCR class with per-character, per-word, and per-digit detection, document boundary detection, table extraction, image-to-image feature matching, cursive/handwriting OCR, and optional NLP post-processing via spaCy.

Installation prerequisites

See TextDetector additional requirements above before using this class.

Basic OCR

import cv2
from openvisionkit.lib.text_detector import TextDetector

image = cv2.imread("document.jpg")

detector = TextDetector(
    image=image,
    lang="eng",         # Tesseract language code(s); multi-language: "eng+chi_sim"
    oem=3,              # OCR Engine Mode — 3 = default (LSTM preferred)
    psm=6,              # Page Segmentation Mode — 6 = single uniform text block
    preprocess=True,    # apply grayscale + histogram equalization + adaptive threshold
    use_gpu=False,      # enable OpenCL GPU acceleration for OpenCV ops
)

# Full text string from the image
text = detector.detect_text()
print(text)

# Switch language at runtime (no need to reinstantiate)
detector.set_language("eng+fra")

# Replace the image on an existing instance
new_image = cv2.imread("page2.jpg")
detector.set_image(new_image)

Word-level detection

words, annotated = detector.detect_words(
    draw_boxes=True,
    bounding_box_color=(255, 0, 0),   # BGR
    text_color=(255, 0, 0),
    font_scale=1,
    font_thickness=2,
)

for word in words:
    print(word["text"])   # recognized word string
    print(word["conf"])   # Tesseract confidence 0–100
    print(word["x"], word["y"], word["w"], word["h"])  # bounding box

cv2.imshow("Words", annotated)
cv2.waitKey(0)

# Convenience accessors
word_strings = detector.get_words()          # List[str]
lines         = detector.get_lines()          # List[str] — full lines
avg_conf      = detector.get_confidence()     # float — mean confidence across all words
df            = detector.to_dataframe()       # pandas DataFrame of word detections

Character-level detection

chars, annotated = detector.detect_characters(
    draw_boxes=True,
    is_dark_background=False,    # set True to invert image before OCR
    adjust_text_height=20,       # vertical offset for label above bounding box
    bounding_box_color=(255, 0, 0),
    text_color=(255, 0, 0),
)

for c in chars:
    print(c["char"])               # single character string
    print(c["x1"], c["y1"])        # top-left (OpenCV coords)
    print(c["x2"], c["y2"])        # bottom-right (OpenCV coords)

Digit-only detection

digits, annotated = detector.detect_digits(image, draw_boxes=True)
print(digits)   # e.g. ['4', '2', '0']

Document & table detection

# Detect document boundary (returns 4-corner numpy array, or None)
corners = detector.detect_document()
if corners is not None:
    print("Document corners:", corners)

# Extract text from table regions using morphological line detection
tables = detector.detect_tables()
for table_text in tables:
    print(table_text)

Orientation & script detection

osd = detector.image_to_osd()
print(osd["Orientation in degrees"])   # e.g. '90'
print(osd["Script"])                   # e.g. 'Latin'

Export formats

# PDF bytes
pdf_bytes = detector.image_to_pdf_or_hocr(extension="pdf")
with open("output.pdf", "wb") as f:
    f.write(pdf_bytes)

# hOCR HTML bytes
hocr_bytes = detector.image_to_pdf_or_hocr(extension="hocr")

# ALTO XML string (structured layout format for digital libraries)
alto_xml = detector.image_to_alto_xml()

Handwriting / cursive OCR

text, preprocessed = detector.extract_cursive_text(image)
print(text)
# preprocessed is the adaptive-threshold binary image used for OCR

Image preprocessing utilities

# Resize (uses imutils to preserve aspect ratio)
resized = detector.resize(width=800)

# Rotate (may clip corners)
rotated = detector.rotate(angle=45)

# Rotate without clipping
rotated_bound = detector.rotate_bound(angle=45)

# Auto deskew (corrects small rotation from skewed scans)
deskewed = detector.deskew()

# Auto Canny edge detection with sigma-based threshold
edges = detector.auto_canny(sigma=0.33)

ORB keypoint detection and image matching

These methods are useful for comparing a scanned form against a template to detect alignment, tampering, or form type.

# Detect ORB keypoints and descriptors
keypoints, descriptors, annotated = detector.detect_keypoints(
    features=500,
    draw_keypoints=True,
    keypoint_color=(0, 255, 0),
)

# Compare two images using KNN feature matching + RANSAC homography
# Falls back to SSIM if not enough features are found
template = cv2.imread("template.jpg")
result = detector.compare_matches_knn_matcher(
    image2=template,
    form_name="Invoice",
    no_of_feature=500,
    matched_amount=50,
    percentage_of_matches=20,
    draw_matches=False,
    draw_aligned=False,
)
print(result["matches"])          # number of good matches
print(result["homography"])       # 3x3 transformation matrix
# result["aligned_image"]         # template warped to match the query
# result["matched_image"]         # side-by-side match visualization

# Brute-force matcher variant (no ratio test, faster but less selective)
result_bf = detector.compare_matches_bf_matcher(image2=template, form_name="Invoice")

# SSIM-based fallback (used automatically, also callable directly)
ssim_result = TextDetector.fallback_ssim(image, template, "Invoice")
print(ssim_result["ssim_score"])  # structural similarity 0.0–1.0

NLP methods (requires spaCy `en_core_web_sm`)

raw_text = detector.detect_text()

# Clean whitespace and newlines
clean = detector.clean_text(raw_text)

# Named entity recognition — returns list of {text, label} dicts
entities = detector.extract_entities(raw_text)
# e.g. [{"text": "Singapore", "label": "GPE"}, {"text": "2026", "label": "DATE"}]

# Group entities by label
grouped = detector.group_entities(raw_text)
# e.g. {"GPE": ["Singapore"], "DATE": ["2026"]}

# Keyword extraction (nouns and proper nouns, stop-words filtered)
keywords = detector.extract_keywords(raw_text)

# Extractive summarization (top N sentences)
summary = detector.summarize(raw_text, max_sentences=3)

# Subject-verb-object relation extraction
relations = detector.extract_relations(raw_text)
# e.g. [{"subject": ["John"], "verb": "signed", "object": ["contract"]}]

GPU acceleration

detector.enable_gpu()    # enables OpenCV OpenCL (requires compatible GPU)
detector.disable_gpu()   # revert to CPU

Project Structure

openvisionkit/
├── __init__.py               # package version (__version__)
├── lib/
│   ├── face_detector.py          # FaceDetector
│   ├── face_mesh_detector.py     # FaceMeshDetector (478 landmarks)
│   ├── hand_detector.py          # HandDetector (21 landmarks)
│   ├── pose_detector.py          # PoseDetector (33 landmarks)
│   ├── object_detector.py        # ObjectDetector (EfficientDet)
│   ├── selfie_segmentation.py    # SelfieSegmentation
│   ├── hair_segmentation.py      # HairSegmentation
│   ├── fps_counter.py            # FPSCounter utility
│   ├── classifier.py             # Generic classifier
│   ├── form_detector.py          # Form / document detector
│   ├── form_roi_detector.py      # Form region-of-interest detector
│   ├── form_roi_annotator.py     # Form annotation utilities
│   ├── image_detector.py         # Image-based detector
│   ├── image_hsv_detector.py     # HSV color-range detector
│   └── text_detector.py          # Text detection
├── capture/
│   ├── video_template.py         # video_capture_template loop
│   ├── screen_capture.py         # ScreenCapture
│   ├── video_recorder.py         # VideoRecorder
│   ├── image_template.py         # Single-image processing template
│   └── draw_object.py            # Drawing helpers
└── utility/
    ├── vision_utilis.py          # Shared image utilities
    └── live_plot.py              # Real-time matplotlib plotting

Running Modes

All detectors support three MediaPipe running modes:

Mode	Use case	Notes
`IMAGE`	Static images	No timestamp needed
`VIDEO`	Webcam / pre-recorded video	Pass `timestamp_ms` or let detector auto-increment
`LIVE_STREAM`	Async streaming	Results delivered via callback

Contributing

Dev setup

git clone https://github.com/your-org/openvisionkit.git
cd openvisionkit
make setup          # uv sync + install pre-commit hooks

Useful Make targets

Target	What it does
`make setup`	Install all deps + pre-commit hooks (run once after clone)
`make format`	Auto-format with black + isort
`make lint`	Run ruff + flake8
`make lint-fix`	Auto-fix ruff-fixable issues
`make test`	Run all non-integration tests
`make test-cov`	Run tests with HTML coverage report
`make typecheck`	mypy static analysis
`make check`	format-check + lint + typecheck (pre-push sanity)

Commit convention

All commits must follow Conventional Commits. The pre-commit hook enforces this.

Prefix	Effect
`fix:`, `perf:`, `refactor:`	patch release
`feat:`	minor release
`feat!:` or `BREAKING CHANGE:` footer	major release
`chore:`, `docs:`, `test:`, `ci:`	no release

CI/CD

Workflow	Trigger	Purpose
`ci-unit.yml`	push / PR	Unit tests on Python 3.11 + 3.12
`ci-integration.yml`	push/PR to main, manual	Integration tests (requires model files)
`ci-security.yml`	push/PR to main, daily 02:00 UTC	pip-audit, Trivy, CodeQL
`renovate.yml`	weekly Monday 01:00 UTC	Automated dependency updates
`semantic-release.yml`	push to main	Semantic version bump + GitHub Release
`publish.yml`	GitHub Release published	Build + publish to PyPI via OIDC

Releases are fully automated — push commits to main and the semantic-release workflow handles version bumping, tagging, changelog generation, and PyPI publishing.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.1

Jun 30, 2026

This version

0.5.0

Jun 30, 2026

0.4.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openvisionkit-0.5.0.tar.gz (332.6 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openvisionkit-0.5.0-py3-none-any.whl (130.3 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file openvisionkit-0.5.0.tar.gz.

File metadata

Download URL: openvisionkit-0.5.0.tar.gz
Upload date: Jun 30, 2026
Size: 332.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openvisionkit-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`ad098751730c345a40346637a5b62f9a6763128a1336035075ee0319cb582d96`
MD5	`87b9d26144016fa76a576142eb72765c`
BLAKE2b-256	`980f21df3af36964e2a74c843ce2d226c32dcafcdea7c18c0d1124454875f645`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openvisionkit-0.5.0.tar.gz:

Publisher: semantic-versioning.yml on anurupborah2001/openvisionkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openvisionkit-0.5.0.tar.gz
- Subject digest: ad098751730c345a40346637a5b62f9a6763128a1336035075ee0319cb582d96
- Sigstore transparency entry: 2021121541
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: anurupborah2001/openvisionkit@d6cf67e8238ce33363b7302834352792d61400d2
- Branch / Tag: refs/heads/master
- Owner: https://github.com/anurupborah2001
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: semantic-versioning.yml@d6cf67e8238ce33363b7302834352792d61400d2
- Trigger Event: push

File details

Details for the file openvisionkit-0.5.0-py3-none-any.whl.

File metadata

Download URL: openvisionkit-0.5.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 130.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openvisionkit-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e747dd711abf15a9b9d6837ec45737dc7f85ec0122de96e00f3ebcc821bcf44c`
MD5	`a6ab8377241cba8bd8b5b6405fa1bfbd`
BLAKE2b-256	`62465137370962a90a5acf37070ea0aed1033d4b48d60951ffce7f4c6d9e8515`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openvisionkit-0.5.0-py3-none-any.whl:

Publisher: semantic-versioning.yml on anurupborah2001/openvisionkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openvisionkit-0.5.0-py3-none-any.whl
- Subject digest: e747dd711abf15a9b9d6837ec45737dc7f85ec0122de96e00f3ebcc821bcf44c
- Sigstore transparency entry: 2021121595
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: anurupborah2001/openvisionkit@d6cf67e8238ce33363b7302834352792d61400d2
- Branch / Tag: refs/heads/master
- Owner: https://github.com/anurupborah2001
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: semantic-versioning.yml@d6cf67e8238ce33363b7302834352792d61400d2
- Trigger Event: push

openvisionkit 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

OpenVisionKit

Features

Requirements

TextDetector additional requirements

Installation

pip

uv

Model Downloads

Quick Start

Usage

FaceDetector

FaceMeshDetector

HandDetector

PoseDetector

ObjectDetector

SelfieSegmentation

HairSegmentation

ScreenCapture

video_capture_template

image_template

TextDetector

Installation prerequisites

Basic OCR

Word-level detection

Character-level detection

Digit-only detection

Document & table detection

Orientation & script detection

Export formats

Handwriting / cursive OCR

Image preprocessing utilities

ORB keypoint detection and image matching

NLP methods (requires spaCy en_core_web_sm)

GPU acceleration

Project Structure

Running Modes

Contributing

Dev setup

Useful Make targets

Commit convention

CI/CD

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

NLP methods (requires spaCy `en_core_web_sm`)