Skip to main content

MediaPipe Tasks API wrapper for Python computer vision

Project description

OpenVisionKit

CI — Unit Tests Security Scan PyPI version Python 3.11+ License: MIT

OpenVisionKit is a high-level Python computer vision library built on top of MediaPipe and OpenCV. It provides production-ready detectors and segmentation utilities for face detection, face mesh, hand tracking, pose estimation, object detection, and background segmentation — wrapped in clean, developer-friendly APIs that eliminate boilerplate and let you focus on building.

Whether you are prototyping a gesture-controlled application, building a fitness tracker, adding AR effects, or conducting research, OpenVisionKit gives you the tools to go from camera frame to structured detections in a few lines of code.


Features

Module Capability
FaceDetector Bounding boxes, 6-point keypoints, confidence filtering, IoU, face cropping
FaceMeshDetector 478 landmarks, blendshapes, head pose (yaw/pitch/roll), gaze direction, emotion, AR overlays
HandDetector 21 landmarks, gesture recognition, finger-join detection, distance estimation, palm width
PoseDetector 33 body landmarks, joint angle calculation, exercise detection, workout rep counter, segmentation
ObjectDetector EfficientDet-based multi-class detection with bounding boxes and labels
SelfieSegmentation Background removal, blur, replacement, virtual backgrounds, alpha blending
HairSegmentation Hair region segmentation and recoloring
ScreenCapture High-performance screen grabbing via mss
video_capture_template Drop-in webcam loop with FPS overlay, recording, and screenshot support
image_template Single-image processing template with auto-centering, resize, and custom logic hook
TextDetector Tesseract OCR with character/word/digit/table detection, NLP entity extraction, image matching, handwriting support

Requirements

  • Python >= 3.11.8
  • A .tflite / .task model file for each MediaPipe detector (see Model Downloads)

TextDetector additional requirements

TextDetector uses Tesseract OCR and optional NLP tooling that are not bundled with MediaPipe.

1. Install Tesseract binary (system-level):

# macOS
brew install tesseract

# Ubuntu / Debian
sudo apt-get install tesseract-ocr

# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki

2. Install Python packages:

# pip
pip install pytesseract imutils pandas scikit-image Pillow

# uv
uv add pytesseract imutils pandas scikit-image Pillow

3. (Optional) spaCy for NLP features (entity extraction, keyword extraction, summarization, relation extraction):

pip install spacy
python -m spacy download en_core_web_sm

Without spaCy, all NLP methods return empty results gracefully — the rest of TextDetector works without it.


Installation

pip

pip install openvisionkit

Or install directly from source:

pip install git+https://github.com/your-org/openvisionkit.git

uv

uv add openvisionkit

Or from source:

uv add git+https://github.com/your-org/openvisionkit.git

For development (editable install with all dev dependencies):

git clone https://github.com/your-org/openvisionkit.git
cd openvisionkit
make setup

Model Downloads

OpenVisionKit delegates inference to MediaPipe .tflite / .task model files. Download the models you need and place them in a models/ directory at your project root.

Detector Model file Download
FaceDetector face_detector.tflite MediaPipe Face Detector
FaceMeshDetector face_landmarker_v2_with_blendshapes.task MediaPipe Face Landmarker
HandDetector hand_landmarker.task MediaPipe Hand Landmarker
PoseDetector pose_landmarker.task MediaPipe Pose Landmarker
ObjectDetector efficientdet_lite.tflite MediaPipe Object Detector
SelfieSegmentation deeplab_v3.tflite MediaPipe Image Segmenter
HairSegmentation hair_segmenter.tflite MediaPipe Hair Segmenter

Quick Start

import cv2
from openvisionkit.capture.video_template import video_capture_template
from openvisionkit.lib.hand_detector import HandDetector

detector = HandDetector(model_path="./models/hand_landmarker.task")

def process(frame):
    frame = detector.draw_landmarks(frame)
    return frame

video_capture_template(custom_logic=process, window_name="Hand Tracking")

Usage

FaceDetector

Detects faces in an image or video stream and returns bounding boxes, keypoints, and confidence scores.

import cv2
from openvisionkit.lib.face_detector import FaceDetector

detector = FaceDetector(
    model_path="./models/face_detector.tflite",
    max_faces=5,
    running_mode="IMAGE",           # "IMAGE" | "VIDEO"
    min_detection_confidence=0.5,
    min_suppression_threshold=0.3,
)

frame = cv2.imread("photo.jpg")

# Returns annotated frame + list of detection dicts
annotated, detections = detector.detect_faces(frame, to_draw_bounding_box=True, to_draw_landmarks=True)

for det in detections:
    print(det["id"])                    # face index
    print(det["score"])                 # confidence 0–1
    print(det["bbox"])                  # (x, y, w, h)
    print(det["bbox_xyxy"])             # (x1, y1, x2, y2)
    print(det["center"])                # (cx, cy)
    print(det["normalized_keypoints"]) # list of (x, y) pixel coords for 6 landmarks

cv2.imshow("Faces", annotated)
cv2.waitKey(0)

Utility methods:

# Filter detections below a confidence threshold
confident = detector.filter_by_confidence(detections, threshold=0.7)

# Get the largest face by bounding-box area
biggest = detector.get_largest_face(detections)

# Crop face regions out of the image (optional pixel margin)
face_crops = detector.crop_faces(frame, detections, margin=10)

# Sort by area (descending) or any other detection key
sorted_faces = detector.sort_faces(detections, by="area")

# Intersection over Union — useful for NMS or tracking
iou = detector.get_iou(detections[0]["bbox_xyxy"], detections[1]["bbox_xyxy"])

FaceMeshDetector

Detects 478 facial landmarks per face along with blendshape expressions and head-pose matrices.

import cv2
from openvisionkit.lib.face_mesh_detector import FaceMeshDetector

detector = FaceMeshDetector(
    model_path="./models/face_landmarker_v2_with_blendshapes.task",
    num_faces=2,
    min_face_detection_confidence=0.5,
    output_face_blendshapes=True,
    output_facial_transformation_matrixes=True,
)

frame = cv2.imread("face.jpg")

annotated, faces, blendshapes, matrices, bboxes = detector.face_mesh_detection(frame, drawLandMarks=True)

# faces[i]       -> list of [x, y] pixel coords for 478 landmarks
# blendshapes[i] -> dict of {blendshape_name: score}  (52 expressions)
# matrices[i]    -> 4x4 numpy head-pose matrix
# bboxes[i]      -> [min_x, min_y, max_x, max_y]

for i, blend in enumerate(blendshapes):
    # Rule-based emotion from blendshapes
    emotion = detector.get_emotion(blend)
    print(f"Face {i}: {emotion}")

    # Gaze direction for each eye
    gaze = detector.get_eye_gaze_direction(faces[i], is_left_eye=True)
    print(f"Left gaze: {gaze}")   # "Left" | "Center" | "Right"

    # Mouth openness ratio (0 = closed, 0.5+ = wide open)
    ratio = detector.get_mouth_openness_ratio(faces[i])
    print(f"Mouth ratio: {ratio:.2f}")

    # Head pose angles from transformation matrix
    if matrices[i] is not None:
        yaw, pitch, roll = detector.get_head_pose_angles(matrices[i])
        print(f"Yaw: {yaw:.1f}  Pitch: {pitch:.1f}  Roll: {roll:.1f}")

    # Inter-pupillary distance
    ipd = detector.get_inter_pupillary_distance(faces[i], normalized=False)
    print(f"IPD: {ipd:.1f}px")

AR overlay example:

# Overlay a PNG glasses filter (must have alpha channel)
glasses = cv2.imread("glasses.png", cv2.IMREAD_UNCHANGED)   # RGBA
frame_with_glasses = detector.overlay_ar_filter(frame, faces[0], glasses, filter_type="glasses")

HandDetector

Tracks up to N hands with 21 landmarks each. Provides gesture recognition, finger-join detection, and distance estimation.

import cv2
from openvisionkit.lib.hand_detector import HandDetector

detector = HandDetector(
    model_path="./models/hand_landmarker.task",
    running_mode="IMAGE",       # "IMAGE" | "VIDEO"
    max_hands=2,
    detection_confidence=0.5,
    tracking_confidence=0.5,
    smoothing_window=8,
)

frame = cv2.imread("hand.jpg")

# Draw landmarks, bounding box, and handedness label
annotated = detector.draw_landmarks(
    frame,
    to_draw_landmark=True,
    to_draw_bounding_box=True,
    to_put_handle_label=True,
)

# Get structured landmark data for all detected hands
all_hands = detector.get_landmarks(frame)

for hand in all_hands:
    print(hand["hand_type"])          # "Left" or "Right"
    print(hand["bounding_box"])       # (x, y, w, h)
    print(hand["center_point"])       # (cx, cy)
    lm = hand["landmarks_list"]       # list of [id, x, y, z]

    # Which fingers are raised?
    fingers = detector.fingers_up(lm)
    # [thumb, index, middle, ring, little] — 1=up, 0=down

    # Gesture shortcuts
    print(detector.is_fist())
    print(detector.is_thumbs_up())
    print(detector.is_peace_sign())
    print(detector.is_open_hand())

    # Distance between any two landmarks with visual feedback
    p1 = (lm[4][1], lm[4][2])   # thumb tip
    p2 = (lm[8][1], lm[8][2])   # index tip
    length, annotated, coords = detector.get_distance(p1, p2, annotated)
    print(f"Thumb-index distance: {length:.1f}px")

    # Detect if two finger tips are touching
    joined = detector.is_fingers_joined(4, 8, annotated, lm, threshold=0.25)

    # Palm width in pixels (stable reference)
    palm_px, idx_mcp, pinky_mcp = detector.palm_width_px(frame, lm)
    print(f"Palm width: {palm_px:.1f}px")

Distance estimation (calibration-based):

# Provide (palm_width_px, distance_cm) pairs to calibrate
calibration = [(180, 20), (120, 35), (80, 55), (60, 75)]
detector_calibrated = HandDetector(
    model_path="./models/hand_landmarker.task",
    calibration_samples=calibration,
)

# After calibration, estimate distance from a new palm width
distance_cm = detector_calibrated.estimate_distance_cm(palm_width_px=110)
print(f"Estimated distance: {distance_cm:.1f} cm")

PoseDetector

Detects 33 body landmarks. Supports joint angle calculation, exercise classification, workout rep counting, and body segmentation.

import cv2
from openvisionkit.lib.pose_detector import PoseDetector
from mediapipe.tasks.python import vision

detector = PoseDetector(
    model_path="./models/pose_landmarker.task",
    running_mode=vision.RunningMode.VIDEO,   # VIDEO for webcam streams
    num_poses=1,
    min_pose_detection_confidence=0.5,
    output_segmentation_masks=True,
)

cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Detect and annotate
    annotated, result = detector.detect(frame, draw_landmarks=True)

    # All landmark positions as pixel dicts
    landmarks = detector.get_all_postion(frame, result)

    # Get a specific landmark (e.g. nose = id 0)
    nose = detector.get_landmark(result, pose_index=0, landmark_id=0)
    print(nose["x"], nose["y"], nose["visibility"])

    # Calculate joint angle — e.g. left elbow (shoulder=11, elbow=13, wrist=15)
    annotated, angle = detector.calculate_angle(annotated, result, p1=11, p2=13, p3=15)
    print(f"Left elbow angle: {angle:.1f} degrees")

    # Classify current exercise
    exercise = detector.detect_exercise(annotated, result)
    print(f"Exercise: {exercise}")

    # Workout rep counter (tracks bicep curls automatically)
    angle, percent, reps = detector.calculate_workout_percentage()
    stats = detector.get_workout_stats(annotated)
    print(f"Reps: {stats['reps']}  Calories: {stats['calories']:.1f}")

    # Body segmentation overlay (requires output_segmentation_masks=True)
    annotated = detector.draw_segmentation_mask(annotated, result, alpha=0.5, color=(0, 255, 0))

    cv2.imshow("Pose", annotated)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

Auto-select the most visible arm for curl tracking:

p1, p2, p3 = detector.select_active_arm(result)
annotated, angle = detector.calculate_angle(annotated, result, p1, p2, p3)

ObjectDetector

Detects multiple object classes in a frame using EfficientDet Lite.

import cv2
from openvisionkit.lib.object_detector import ObjectDetector

detector = ObjectDetector(
    model_path="./models/efficientdet_lite.tflite",
    max_results=5,
    running_mode="IMAGE",           # "IMAGE" | "VIDEO"
    category_allowlist=None,        # e.g. ["person", "car"] to restrict classes
    category_denylist=None,
)

frame = cv2.imread("street.jpg")

# Returns annotated image with bounding boxes and labels drawn
annotated = detector.detect_objects(frame)

cv2.imshow("Objects", annotated)
cv2.waitKey(0)

# Or get raw detection result for custom processing
result, mp_image = detector.detect(frame)
for detection in result.detections:
    label = detection.categories[0].category_name
    score = detection.categories[0].score
    bbox  = detection.bounding_box
    print(f"{label}: {score:.2f} @ ({bbox.origin_x}, {bbox.origin_y})")

SelfieSegmentation

Separates people from backgrounds using DeepLab V3. Multiple compositing modes available.

import cv2
from openvisionkit.lib.selfie_segmentation import SelfieSegmentation

seg = SelfieSegmentation(
    model_path="./models/deeplab_v3.tflite",
    output_category_mask=True,
)

frame = cv2.imread("selfie.jpg")

# Remove background (black fill)
no_bg = seg.remove_background(frame)

# Blur background
blurred = seg.blur_background(frame, blur_strength=(55, 55))

# Replace background with an image
replaced = seg.replace_background(frame, background_path="./bg.jpg")

# Solid color background
colored = seg.color_background(frame, color=(0, 120, 255))

# Alpha-blend foreground over a custom background array
bg = cv2.imread("./bg.jpg")
blended = seg.alpha_blend(frame, bg)

# Optimized virtual background with temporal smoothing + edge refinement
# (best for real-time webcam use)
output = seg.optimize_virtual_background(frame, bg)

# Single-person isolation — removes other people in the background
output = seg.optimize_virtual_background_improved(frame, bg)

# Debug: visualize the raw segmentation heatmap
heatmap = seg.overlay_mask(frame)

cv2.imshow("Segmented", output)
cv2.waitKey(0)

HairSegmentation

Segments hair regions for recoloring or styling effects.

import cv2
import numpy as np
from openvisionkit.lib.hair_segmentation import HairSegmentation

seg = HairSegmentation(model_path="./models/hair_segmenter.tflite")

frame = cv2.imread("portrait.jpg")
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

result = seg.process(rgb_frame)
mask = result.category_mask.numpy_view()    # shape (H, W), values 0–1

# Recolor hair to blue
hair_color = np.zeros_like(frame)
hair_color[:] = (255, 0, 0)                 # BGR blue
hair_region = (mask > 0.5)[..., None]
output = np.where(hair_region, hair_color, frame)

cv2.imshow("Hair", output)
cv2.waitKey(0)

ScreenCapture

Captures live frames from a monitor — useful for screen-based CV pipelines.

from openvisionkit.capture.screen_capture import ScreenCapture
import cv2

cap = ScreenCapture(monitor_index=1)  # 1 = primary monitor

while True:
    frame = cap.grab()                # returns BGR numpy array
    cv2.imshow("Screen", frame)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cv2.destroyAllWindows()

video_capture_template

A reusable webcam loop that handles window setup, FPS display, recording, and screenshots. Pass a custom_logic callback for your processing.

import cv2
from openvisionkit.capture.video_template import video_capture_template
from openvisionkit.lib.face_detector import FaceDetector

detector = FaceDetector(model_path="./models/face_detector.tflite", running_mode="VIDEO")

def process(frame):
    annotated, _ = detector.detect_faces(frame)
    return annotated

video_capture_template(
    video_source=0,                       # webcam index or path to video file
    custom_logic=process,
    window_name="Face Detection",
    resolution=(1280, 720),
    draw_fps=True,
    enable_auto_recording=True,           # auto-saves .mp4 from first frame
    record_format="mp4",                  # "mp4" or "gif"
    enable_screenshot=True,               # press 's' to capture a frame
    auto_screenshot_after_seconds=10.0,   # also auto-capture after 10 s
    auto_screenshot_repeat=False,         # True = repeat every 10 s
)

Key bindings (built-in):

Key Action Condition
ESC Exit loop always
s / S Save screenshot enable_screenshot=True
r / R Toggle manual recording on/off enable_manual_recording=True

Stateful key handlers with KeyEventManager:

from openvisionkit.capture.video_template import KeyEventManager, video_capture_template

state = {"score": 0}
km = KeyEventManager()
km.register(ord("p"), lambda frame, s: print(f"Score: {s['score']}"))
km.register(ord("+"), lambda frame, s: s.update({"score": s["score"] + 1}))

video_capture_template(
    video_source=0,
    state=state,
    key_manager=km,
    custom_logic=lambda frame: frame,
)

Manual recording:

video_capture_template(
    video_source=0,
    enable_manual_recording=True,   # press R to start, R again to stop and save
    record_format="gif",
)

Parameter reference:

Parameter Type Default Description
video_source int | str 0 Camera index or path to video file
loop_forever bool True Loop video file when it ends
custom_logic Callable[[ndarray], ndarray] None Per-frame processing; receives and returns BGR image
state dict None Shared state dict passed to every key handler
key_manager KeyEventManager None Custom key-event dispatcher
window_name str "Demo" OpenCV window title
show_window bool True Display the OpenCV window
resolution tuple[int, int] (1280, 720) Camera resolution (width, height)
center_window bool True Auto-center window on screen via pyautogui
draw_fps bool True Overlay FPS counter on frame
fps int 15 Recording frame rate (auto-recording only)
mouse_callback Callable None OpenCV mouse-event callback
mouse_callback_params dict None Extra params passed to mouse callback
enable_auto_recording bool False Record every frame automatically from start
enable_manual_recording bool False Allow toggling recording with R key
record_format str "mp4" "mp4" or "gif"
enable_screenshot bool False Enable s-key and auto-screenshot
screenshot_output_dir str "screenshots" Directory for saved screenshots
screenshot_prefix str "capture" Filename prefix before timestamp
auto_screenshot_after_seconds float None Trigger first screenshot after N seconds
auto_screenshot_repeat bool False Repeat auto-screenshot every N seconds

image_template

A single-image equivalent of video_capture_template. Loads one image from disk, applies an optional processing callback, resizes to the target resolution, auto-centers the window on screen, and displays it.

import cv2
from openvisionkit.capture.image_template import image_template
from openvisionkit.lib.face_detector import FaceDetector

detector = FaceDetector(model_path="./models/face_detector.tflite", running_mode="IMAGE")

def process(frame):
    annotated, _ = detector.detect_faces(frame)
    return annotated

image_template(
    image_path="photo.jpg",
    custom_logic=process,       # receives the loaded BGR image, must return BGR image
    window_name="Face Demo",
    resolution=(1280, 720),     # image is resized to this before display
    center_window=True,         # auto-centers window on screen via pyautogui
    show_window=True,           # set False to run headless (e.g. save to disk instead)
)

Without a custom_logic callback the image is loaded, resized, and displayed as-is:

image_template(image_path="photo.jpg")

Parameter reference:

Parameter Type Default Description
image_path str required Path to the image file
custom_logic Callable[[ndarray], ndarray] None Processing function applied before display
window_name str "Demo" OpenCV window title
resolution tuple[int, int] (1280, 720) (width, height) to resize the image
center_window bool True Move window to screen center via pyautogui
show_window bool True Display the OpenCV window

TextDetector

Tesseract-backed OCR class with per-character, per-word, and per-digit detection, document boundary detection, table extraction, image-to-image feature matching, cursive/handwriting OCR, and optional NLP post-processing via spaCy.

Installation prerequisites

See TextDetector additional requirements above before using this class.

Basic OCR

import cv2
from openvisionkit.lib.text_detector import TextDetector

image = cv2.imread("document.jpg")

detector = TextDetector(
    image=image,
    lang="eng",         # Tesseract language code(s); multi-language: "eng+chi_sim"
    oem=3,              # OCR Engine Mode — 3 = default (LSTM preferred)
    psm=6,              # Page Segmentation Mode — 6 = single uniform text block
    preprocess=True,    # apply grayscale + histogram equalization + adaptive threshold
    use_gpu=False,      # enable OpenCL GPU acceleration for OpenCV ops
)

# Full text string from the image
text = detector.detect_text()
print(text)

# Switch language at runtime (no need to reinstantiate)
detector.set_language("eng+fra")

# Replace the image on an existing instance
new_image = cv2.imread("page2.jpg")
detector.set_image(new_image)

Word-level detection

words, annotated = detector.detect_words(
    draw_boxes=True,
    bounding_box_color=(255, 0, 0),   # BGR
    text_color=(255, 0, 0),
    font_scale=1,
    font_thickness=2,
)

for word in words:
    print(word["text"])   # recognized word string
    print(word["conf"])   # Tesseract confidence 0–100
    print(word["x"], word["y"], word["w"], word["h"])  # bounding box

cv2.imshow("Words", annotated)
cv2.waitKey(0)

# Convenience accessors
word_strings = detector.get_words()          # List[str]
lines         = detector.get_lines()          # List[str] — full lines
avg_conf      = detector.get_confidence()     # float — mean confidence across all words
df            = detector.to_dataframe()       # pandas DataFrame of word detections

Character-level detection

chars, annotated = detector.detect_characters(
    draw_boxes=True,
    is_dark_background=False,    # set True to invert image before OCR
    adjust_text_height=20,       # vertical offset for label above bounding box
    bounding_box_color=(255, 0, 0),
    text_color=(255, 0, 0),
)

for c in chars:
    print(c["char"])               # single character string
    print(c["x1"], c["y1"])        # top-left (OpenCV coords)
    print(c["x2"], c["y2"])        # bottom-right (OpenCV coords)

Digit-only detection

digits, annotated = detector.detect_digits(image, draw_boxes=True)
print(digits)   # e.g. ['4', '2', '0']

Document & table detection

# Detect document boundary (returns 4-corner numpy array, or None)
corners = detector.detect_document()
if corners is not None:
    print("Document corners:", corners)

# Extract text from table regions using morphological line detection
tables = detector.detect_tables()
for table_text in tables:
    print(table_text)

Orientation & script detection

osd = detector.image_to_osd()
print(osd["Orientation in degrees"])   # e.g. '90'
print(osd["Script"])                   # e.g. 'Latin'

Export formats

# PDF bytes
pdf_bytes = detector.image_to_pdf_or_hocr(extension="pdf")
with open("output.pdf", "wb") as f:
    f.write(pdf_bytes)

# hOCR HTML bytes
hocr_bytes = detector.image_to_pdf_or_hocr(extension="hocr")

# ALTO XML string (structured layout format for digital libraries)
alto_xml = detector.image_to_alto_xml()

Handwriting / cursive OCR

text, preprocessed = detector.extract_cursive_text(image)
print(text)
# preprocessed is the adaptive-threshold binary image used for OCR

Image preprocessing utilities

# Resize (uses imutils to preserve aspect ratio)
resized = detector.resize(width=800)

# Rotate (may clip corners)
rotated = detector.rotate(angle=45)

# Rotate without clipping
rotated_bound = detector.rotate_bound(angle=45)

# Auto deskew (corrects small rotation from skewed scans)
deskewed = detector.deskew()

# Auto Canny edge detection with sigma-based threshold
edges = detector.auto_canny(sigma=0.33)

ORB keypoint detection and image matching

These methods are useful for comparing a scanned form against a template to detect alignment, tampering, or form type.

# Detect ORB keypoints and descriptors
keypoints, descriptors, annotated = detector.detect_keypoints(
    features=500,
    draw_keypoints=True,
    keypoint_color=(0, 255, 0),
)

# Compare two images using KNN feature matching + RANSAC homography
# Falls back to SSIM if not enough features are found
template = cv2.imread("template.jpg")
result = detector.compare_matches_knn_matcher(
    image2=template,
    form_name="Invoice",
    no_of_feature=500,
    matched_amount=50,
    percentage_of_matches=20,
    draw_matches=False,
    draw_aligned=False,
)
print(result["matches"])          # number of good matches
print(result["homography"])       # 3x3 transformation matrix
# result["aligned_image"]         # template warped to match the query
# result["matched_image"]         # side-by-side match visualization

# Brute-force matcher variant (no ratio test, faster but less selective)
result_bf = detector.compare_matches_bf_matcher(image2=template, form_name="Invoice")

# SSIM-based fallback (used automatically, also callable directly)
ssim_result = TextDetector.fallback_ssim(image, template, "Invoice")
print(ssim_result["ssim_score"])  # structural similarity 0.0–1.0

NLP methods (requires spaCy en_core_web_sm)

raw_text = detector.detect_text()

# Clean whitespace and newlines
clean = detector.clean_text(raw_text)

# Named entity recognition — returns list of {text, label} dicts
entities = detector.extract_entities(raw_text)
# e.g. [{"text": "Singapore", "label": "GPE"}, {"text": "2026", "label": "DATE"}]

# Group entities by label
grouped = detector.group_entities(raw_text)
# e.g. {"GPE": ["Singapore"], "DATE": ["2026"]}

# Keyword extraction (nouns and proper nouns, stop-words filtered)
keywords = detector.extract_keywords(raw_text)

# Extractive summarization (top N sentences)
summary = detector.summarize(raw_text, max_sentences=3)

# Subject-verb-object relation extraction
relations = detector.extract_relations(raw_text)
# e.g. [{"subject": ["John"], "verb": "signed", "object": ["contract"]}]

GPU acceleration

detector.enable_gpu()    # enables OpenCV OpenCL (requires compatible GPU)
detector.disable_gpu()   # revert to CPU

Project Structure

openvisionkit/
├── __init__.py               # package version (__version__)
├── lib/
│   ├── face_detector.py          # FaceDetector
│   ├── face_mesh_detector.py     # FaceMeshDetector (478 landmarks)
│   ├── hand_detector.py          # HandDetector (21 landmarks)
│   ├── pose_detector.py          # PoseDetector (33 landmarks)
│   ├── object_detector.py        # ObjectDetector (EfficientDet)
│   ├── selfie_segmentation.py    # SelfieSegmentation
│   ├── hair_segmentation.py      # HairSegmentation
│   ├── fps_counter.py            # FPSCounter utility
│   ├── classifier.py             # Generic classifier
│   ├── form_detector.py          # Form / document detector
│   ├── form_roi_detector.py      # Form region-of-interest detector
│   ├── form_roi_annotator.py     # Form annotation utilities
│   ├── image_detector.py         # Image-based detector
│   ├── image_hsv_detector.py     # HSV color-range detector
│   └── text_detector.py          # Text detection
├── capture/
│   ├── video_template.py         # video_capture_template loop
│   ├── screen_capture.py         # ScreenCapture
│   ├── video_recorder.py         # VideoRecorder
│   ├── image_template.py         # Single-image processing template
│   └── draw_object.py            # Drawing helpers
└── utility/
    ├── vision_utilis.py          # Shared image utilities
    └── live_plot.py              # Real-time matplotlib plotting

Running Modes

All detectors support three MediaPipe running modes:

Mode Use case Notes
IMAGE Static images No timestamp needed
VIDEO Webcam / pre-recorded video Pass timestamp_ms or let detector auto-increment
LIVE_STREAM Async streaming Results delivered via callback

Contributing

Dev setup

git clone https://github.com/your-org/openvisionkit.git
cd openvisionkit
make setup          # uv sync + install pre-commit hooks

Useful Make targets

Target What it does
make setup Install all deps + pre-commit hooks (run once after clone)
make format Auto-format with black + isort
make lint Run ruff + flake8
make lint-fix Auto-fix ruff-fixable issues
make test Run all non-integration tests
make test-cov Run tests with HTML coverage report
make typecheck mypy static analysis
make check format-check + lint + typecheck (pre-push sanity)

Commit convention

All commits must follow Conventional Commits. The pre-commit hook enforces this.

Prefix Effect
fix:, perf:, refactor: patch release
feat: minor release
feat!: or BREAKING CHANGE: footer major release
chore:, docs:, test:, ci: no release

CI/CD

Workflow Trigger Purpose
ci-unit.yml push / PR Unit tests on Python 3.11 + 3.12
ci-integration.yml push/PR to main, manual Integration tests (requires model files)
ci-security.yml push/PR to main, daily 02:00 UTC pip-audit, Trivy, CodeQL
renovate.yml weekly Monday 01:00 UTC Automated dependency updates
semantic-release.yml push to main Semantic version bump + GitHub Release
publish.yml GitHub Release published Build + publish to PyPI via OIDC

Releases are fully automated — push commits to main and the semantic-release workflow handles version bumping, tagging, changelog generation, and PyPI publishing.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openvisionkit-0.5.0.tar.gz (332.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openvisionkit-0.5.0-py3-none-any.whl (130.3 kB view details)

Uploaded Python 3

File details

Details for the file openvisionkit-0.5.0.tar.gz.

File metadata

  • Download URL: openvisionkit-0.5.0.tar.gz
  • Upload date:
  • Size: 332.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openvisionkit-0.5.0.tar.gz
Algorithm Hash digest
SHA256 ad098751730c345a40346637a5b62f9a6763128a1336035075ee0319cb582d96
MD5 87b9d26144016fa76a576142eb72765c
BLAKE2b-256 980f21df3af36964e2a74c843ce2d226c32dcafcdea7c18c0d1124454875f645

See more details on using hashes here.

Provenance

The following attestation bundles were made for openvisionkit-0.5.0.tar.gz:

Publisher: semantic-versioning.yml on anurupborah2001/openvisionkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openvisionkit-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: openvisionkit-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 130.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openvisionkit-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e747dd711abf15a9b9d6837ec45737dc7f85ec0122de96e00f3ebcc821bcf44c
MD5 a6ab8377241cba8bd8b5b6405fa1bfbd
BLAKE2b-256 62465137370962a90a5acf37070ea0aed1033d4b48d60951ffce7f4c6d9e8515

See more details on using hashes here.

Provenance

The following attestation bundles were made for openvisionkit-0.5.0-py3-none-any.whl:

Publisher: semantic-versioning.yml on anurupborah2001/openvisionkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page