Skip to main content

Sign Language Toolkit for sign language research

Project description

Sign Language Toolkit (SLTK)

A research toolkit for sign language video analysis. SLTK provides a complete pipeline from raw video to linguistic annotations: pose extraction, automatic segmentation, gloss spotting, and corpus analysis — all accessible via Python, CLI, or a web interface.

Installation

# Core library (data loading, ELAN I/O, CLI)
pip install signlangtk

# With web interface
pip install "signlangtk[api]"

# With GPU-accelerated pose extraction (WiLoR hand model)
pip install "signlangtk[wilor]"

# Everything
pip install "signlangtk[all]"

The PyPI package is called signlangtk, but the Python import is sltk:

import sltk
from sltk.data import PoseSequence, Segment

Requires Python 3.10+. For development from source:

git clone https://github.com/ed-fish/Sign-Language-Toolkit.git
cd Sign-Language-Toolkit
pip install -e ".[dev,api]"

The Processing Pipeline

SLTK's core workflow is a three-stage pipeline that turns raw sign language video into searchable, annotated ELAN files:

Video (.mp4)
    │
    ├── Stage 1: Pose Extraction ──► {stem}_wilor.h5
    │       WiLoR hand model: MANO rotation matrices + 3D keypoints
    │
    ├── Stage 2: Segmentation ──► {stem}_segments.eaf
    │       Transformer model predicts sign boundaries (BIO labels)
    │
    └── Stage 3: Spotting ──► {stem}_spotted.eaf
            SignRep model matches segments to a dictionary of known signs

Each stage can be run independently. If you already have H5 pose files, start at Stage 2. If you already have segment boundaries, start at Stage 3.


Stage 1: Pose Extraction

Extract hand poses from video using the WiLoR hand model. This produces an HDF5 file containing MANO rotation matrices and 21 3D keypoints per detected hand, per frame.

Python

from sltk.extraction.wilor import WiLoRExtractor

# Weights are auto-downloaded from HuggingFace Hub on first use
extractor = WiLoRExtractor()
extractor.load_model()  # downloads to ~/.cache/sltk/weights/ if needed
result = extractor.extract_from_video("video.mp4")
# Saves to video_wilor.h5
extractor.close()

API

# Start extraction job (runs in background on GPU)
curl -X POST http://localhost:8000/api/extraction/start \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/video.mp4",
    "output_root": "/data/output",
    "config": {"enable_wilor": true, "device": "cuda"}
  }'
# Returns: {"job_id": "abc123", ...}

# Poll progress
curl http://localhost:8000/api/extraction/status/abc123

Output format

The WiLoR H5 file has this structure:

video_wilor.h5
├── attrs: fps, num_frames, resolution, extractor
├── frame_idx      (num_frames, 2)           # (start_idx, count) per frame
├── kpts_3d        (num_detections, 21, 3)   # 3D hand keypoints
├── right          (num_detections,)          # True = right hand
└── mano/
    ├── hand_pose      (num_detections, 15, 3, 3)   # joint rotations
    └── global_orient  (num_detections, 1, 3, 3)     # wrist rotation

Model weights

All model weights are automatically downloaded from HuggingFace Hub on first use and cached at ~/.cache/sltk/weights/. No manual download needed.

Resolution order (first match wins):

  1. Explicit path in config (e.g. WiLoRConfig(checkpoint_path="..."))
  2. Environment variable (e.g. SLTK_WILOR_CHECKPOINT)
  3. Bundled at sltk/weights/
  4. HF Hub cache at ~/.cache/sltk/weights/
  5. Auto-download from HuggingFace Hub

To override the cache location, set SLTK_WEIGHTS_DIR.

Other extractors (MediaPipe, NLF/SMPL-X, TEASER, RTMPose) are also available — see sltk/extraction/.


Stage 2: Segmentation

The segmenter is a 4-layer Transformer that reads WiLoR H5 files and predicts per-frame BIO labels (0=OUT, 1=IN_SIGN, 2=BEGIN), identifying where individual signs start and end.

Python — high level

from sltk.segmentation.runner import segment_h5
from sltk.segmentation.output import OutputFormat

# Segment a single file → ELAN output
segment_h5(
    "video_wilor.h5",
    output_path="video_segments.eaf",
    output_format=OutputFormat.ELAN,
    fps=25.0,
    media_path="video.mp4",  # links the video in the EAF
)

# Segment a single file → JSON output
segment_h5(
    "video_wilor.h5",
    output_path="video_segments.json",
    output_format=OutputFormat.JSON,
    fps=25.0,
)

# Segment an entire directory
segment_h5(
    "/data/poses/",
    output_path="/data/segments/output.json",
    output_format=OutputFormat.JSON,
    fps=25.0,
)

Python — low level

from sltk.segmentation.runner import get_runner
from sltk.segmentation.h5_loader import h5_to_features
from sltk.segmentation.postprocess import extract_segments

# Load H5 → 192-dim feature vectors (MANO rotations as axis-angle)
features = h5_to_features("video_wilor.h5")  # shape: (num_frames, 192)

# Run the Transformer model
runner = get_runner()  # singleton, loads checkpoint once
labels = runner.predict(features)  # shape: (num_frames,) values 0/1/2

# Extract segment boundaries
segments = extract_segments(labels)
# [(12, 45), (50, 82), (90, 120), ...]

API

# Segment a single file
curl -X POST http://localhost:8000/api/segmentation/segment \
  -H "Content-Type: application/json" \
  -d '{"h5_path": "/data/video_wilor.h5", "fps": 25.0}'

# Batch segment a directory
curl -X POST http://localhost:8000/api/segmentation/segment/batch \
  -H "Content-Type: application/json" \
  -d '{
    "directory": "/data/poses/",
    "fps": 25.0,
    "output_path": "/data/segments/",
    "output_format": "json"
  }'

JSON output

{
  "video_name": {
    "fps": 25.0,
    "num_frames": 3000,
    "segments": [
      {"start_frame": 12, "end_frame": 45, "start_sec": 0.48, "end_sec": 1.80},
      {"start_frame": 50, "end_frame": 82, "start_sec": 2.00, "end_sec": 3.28}
    ]
  }
}

ELAN output

Creates a tier named {video_name}_segmentation with each segment labelled SIGN, authored by segmenter_v2.

Model checkpoint

Auto-downloaded from HuggingFace Hub on first use. Override with SLTK_SEGMENTOR_CHECKPOINT env var if needed.


Stage 3: Gloss Spotting

The spotter uses SignRep (a ViT-based model) to extract 768-dim visual features from 16-frame sliding windows, then matches each detected segment against a dictionary of known sign features using cosine similarity.

Prerequisites

  • Segment boundaries from Stage 2 (or your own)
  • A dictionary: a folder of .npz files (one per sign), each containing a best_latent key with a 768-dim feature vector

Python — full pipeline

from sltk.embedding.pipeline import SignRepPipeline

pipeline = SignRepPipeline()

# 1. Extract dense features from the full video
continuous = pipeline.extract_continuous("video.mp4", stride=4)
# continuous.features shape: (num_windows, 768), L2-normalized

# 2. Load dictionary
dictionary = pipeline.load_dictionary(
    ["/data/dictionaries/bsldict/signrep/"],
    feature_key="best_latent",
)

# 3. Define segments (from Stage 2, or load from JSON/EAF)
segments = [
    {"segment_id": 0, "start_frame": 12, "end_frame": 45},
    {"segment_id": 1, "start_frame": 50, "end_frame": 82},
]

# 4. Match each segment against the dictionary
result = pipeline.spot(
    features=continuous,
    segments=segments,
    dictionary=dictionary,
    top_k=10,
    segment_pooling="max",  # or "mean", "softmax_weighted"
)

# 5. Inspect results
for seg in result.segments:
    print(f"Segment {seg.start_ms}ms–{seg.end_ms}ms:")
    for gl in seg.top_glosses:
        print(f"  Rank {gl['rank']}: {gl['gloss']} ({gl['similarity']:.3f})")

# 6. Save as ELAN (creates Rank-1..N and Score-1..N tiers)
result.save_eaf("video_spotted.eaf", media_path="video.mp4")

Python — one-shot

result = pipeline.spot_from_video(
    video_path="video.mp4",
    segments_json="video_segments.json",
    dictionary_dirs=["/data/dictionaries/bsldict/signrep/"],
    top_k=20,
    stride=4,
)

API

# Extract continuous features (cached server-side)
curl -X POST http://localhost:8000/api/signrep/continuous/extract \
  -H "Content-Type: application/json" \
  -d '{"video_path": "/data/video.mp4", "stride": 4}'
# Returns: {"features_id": "abc123", "num_windows": 500, ...}

# Spot glosses
curl -X POST http://localhost:8000/api/signrep/spot \
  -H "Content-Type: application/json" \
  -d '{
    "features_id": "abc123",
    "segments": [{"segment_id": 0, "start_frame": 12, "end_frame": 45}],
    "dictionary_dirs": ["/data/dictionaries/bsldict/signrep/"],
    "top_k": 10
  }'

Building a dictionary

Before spotting, you need a dictionary. Extract one feature per isolated sign video:

from sltk.embedding.pipeline import SignRepPipeline

pipeline = SignRepPipeline()
result = pipeline.extract_dictionary("isolated_sign.mp4", method="middle")
result.save_npz("dictionary/HELLO.npz")

Batch extraction via the API:

curl -X POST http://localhost:8000/api/signrep/dictionary/batch/job \
  -H "Content-Type: application/json" \
  -d '{
    "video_dir": "/data/isolated_signs/",
    "output_dir": "/data/dictionary/",
    "method": "middle"
  }'

Model checkpoint

Auto-downloaded from HuggingFace Hub on first use. Override with SLTK_SIGNREP_CHECKPOINT env var if needed.


End-to-End Processing

The processing API combines Stages 2 and 3 into a single background job. It expects WiLoR H5 files to already exist alongside the videos ({stem}_wilor.h5).

API

# Segmentation only
curl -X POST http://localhost:8000/api/processing/submit \
  -H "Content-Type: application/json" \
  -d '{
    "video_paths": ["/data/video1.mp4", "/data/video2.mp4"],
    "type": "segments",
    "fps": 25.0
  }'

# Segmentation + spotting
curl -X POST http://localhost:8000/api/processing/submit \
  -H "Content-Type: application/json" \
  -d '{
    "video_paths": ["/data/video1.mp4"],
    "type": "spots",
    "dictionary_dirs": ["/data/dictionaries/bsldict/signrep/"],
    "top_k": 5,
    "fps": 25.0,
    "workspace": "my_workspace"
  }'

# Poll job
curl http://localhost:8000/api/processing/status/{job_id}

# Download result
curl -O http://localhost:8000/api/processing/output/{job_id}/video1_spotted.eaf

Output files

Job type Output Description
segments {stem}_segments.eaf Sign boundaries (SIGN labels)
spots {stem}_segments.eaf + {stem}_spotted.eaf Boundaries + ranked gloss labels

When a workspace is specified, output EAF files are automatically ingested into the corpus database for search and analysis.


Feature Extraction

SLTK computes several feature representations from H5 pose data, used by the segmenter and available for your own research.

WiLoR segmenter features (192-dim)

Used by the Transformer segmenter (Stage 2). Converts MANO rotation matrices to axis-angle, concatenates left and right hand:

from sltk.segmentation.h5_loader import h5_to_features

features = h5_to_features("video_wilor.h5")
# shape: (num_frames, 192)
# = 2 hands × 96 dims (16 joints × 6 axis-angle params)

Angle features (104-dim)

Body joint angles and hand Euler angles from MANO rotation matrices:

from sltk.processing.features import compute_angle_features

angles = compute_angle_features(body_poses, left_hand_poses, right_hand_poses)
# shape: (num_frames, 104)
# = 22 body angles + 41 left hand + 41 right hand

HaMeR features (288-dim)

Flattened MANO rotation matrices:

from sltk.processing.features import load_features_from_h5

angles, hamer = load_features_from_h5("video_mediapipe.h5", "video_wilor.h5")
# angles: (T, 104)
# hamer:  (T, 288) = 2 × (135 hand_pose + 9 global_orient)

SignRep embeddings (768-dim)

Dense visual features from the SignRep ViT model:

from sltk.embedding.pipeline import SignRepPipeline

pipeline = SignRepPipeline()
continuous = pipeline.extract_continuous("video.mp4", stride=4)
# continuous.features: (num_windows, 768), L2-normalized
# 16-frame windows at stride 4

Running the Web Interface

SLTK includes a React frontend for browsing workspaces, running processing jobs, and exploring corpus data.

Development mode

# Start FastAPI backend (port 8000) + Vite dev server (port 5173)
bash scripts/run_dev.sh

This launches both servers. The frontend is available at http://localhost:5173 and proxies API requests to the backend. Interactive API docs are at http://localhost:8000/docs.

Production mode

# Build the frontend
cd frontend && npm ci && npm run build && cd ..

# Serve everything from FastAPI
sltk serve --host 0.0.0.0 --port 8000

The built frontend is served as static files from FastAPI at http://localhost:8000.

Backend only

uvicorn sltk.api.main:app --host 0.0.0.0 --port 8000

Frontend pages

Route Page Purpose
/ Workspaces Create/switch workspaces, scan directories
/process Process Submit segmentation and spotting jobs
/explore Explore Search glosses, view video clips, corpus statistics
/viewer Viewer Video playback with annotation overlay
/analysis/* Analysis Vocabulary, concordance, n-grams, collocations, durations

NMS Detection (Non-Manual Signals)

Detect blinks, head nods, shakes, tilts, and other non-manual signals from TEASER/FLAME face-tracking H5 files.

Python

from sltk.nms.runner import detect_nms, export_results

# Detect all NMS events
blinks, nms_events, quality = detect_nms(
    "video_teaser.h5",
    detectors={"all"},  # or {"blink", "nod", "shake", "tilt", "mouth", "eyebrow"}
)

# Export to ELAN
export_results(blinks, nms_events, "video_teaser.h5",
    output_dir="output/",
    formats=["elan", "json", "csv"],
    participant_id="P01",
)

Available detectors

Detector Tier Signal
blink BLINK Eye closures from eyelid parameters
nod HEAD-NOD Vertical head oscillation (pitch)
shake HEAD-SHAKE Horizontal head oscillation (yaw)
tilt HEAD-TILT Side-to-side head tilt (roll)
mouth MOUTH Lip/mouth movement from FLAME expression
eyebrow EYEBROW Eyebrow raise/furrow from FLAME expression
gaze EYE-GAZE Gaze direction (requires NLF/SMPL file)
squint EYE-SQUINT Partial eye closure

API

curl -X POST http://localhost:8000/api/nms/detect \
  -H "Content-Type: application/json" \
  -d '{
    "h5_path": "/data/video_teaser.h5",
    "detectors": ["all"],
    "format": ["elan"],
    "output_dir": "/data/output/"
  }'

CLI Reference

sltk convert input.npy output.h5 --from mediapipe --to wilor --fps 25
sltk evaluate predictions.txt references.txt --task translation
sltk to-elan segments.json --video source.mp4 --output annotations.eaf
sltk from-elan annotations.eaf --output segments.json --tier Gloss
sltk info video_wilor.h5
sltk serve --host 0.0.0.0 --port 8000 --reload
sltk formats

Data Types

from sltk.data import PoseSequence, Segment, SegmentList

# Load poses
poses = PoseSequence.load("video_wilor.h5", format="wilor", fps=25)
poses.data     # (num_frames, num_keypoints, 3)
poses.fps      # 25.0
poses.format   # "wilor"

# Load ELAN annotations
from sltk.io import read_eaf, write_eaf
segments = read_eaf("annotations.eaf", tiers=["Gloss"])

# Create and export segments
new_segments = SegmentList([
    Segment(start=0.0, end=1.5, label="HELLO", tier="Gloss"),
    Segment(start=1.5, end=3.0, label="WORLD", tier="Gloss"),
])
write_eaf(new_segments, "output.eaf", video_path="source.mp4")

Supported pose formats

Format Keypoints Description
MediaPipe 33 body + 21x2 hands + 468 face Holistic pose estimation
WiLoR 21 per hand MANO hand model with rotation matrices
NLF/SMPL-X 55 joints Full body with axis-angle rotations

All stored as HDF5 (.h5) files.

Configuration

Variable Description Default
SLTK_CORS_ORIGINS Allowed CORS origins http://localhost:5173,http://localhost:3000
SLTK_ALLOWED_PATHS Filesystem whitelist for API /vol/research,/home
SLTK_WILOR_CHECKPOINT WiLoR model checkpoint auto-resolved
SLTK_WILOR_DETECTOR WiLoR hand detector auto-resolved
SLTK_SIGNREP_CHECKPOINT SignRep model checkpoint auto-resolved
SLTK_SEGMENTOR_CHECKPOINT Segmenter checkpoint auto-resolved
SLTK_NLF_MODEL_PATH NLF model path

Testing

pytest                       # Full suite (1500+ tests)
pytest -m "not slow"         # Skip slow tests
pytest -m api                # API tests only
pytest --cov=sltk            # With coverage report

License

CC-BY-NC-4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

signlangtk-0.1.3.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

signlangtk-0.1.3-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file signlangtk-0.1.3.tar.gz.

File metadata

  • Download URL: signlangtk-0.1.3.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for signlangtk-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d57ae280fe039902580f48c6627161b1887b3e7ec89af54dba358602ebaf2a6c
MD5 c7993c9d4f7c82997a62b81f4c152273
BLAKE2b-256 03b0d5adf6ef4facab56bb80f450d003bf03107f09f816d725a4c544da3a7d7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for signlangtk-0.1.3.tar.gz:

Publisher: publish.yml on ed-fish/Sign-Language-Toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file signlangtk-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: signlangtk-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for signlangtk-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f1628a19b1b0271480690b7bcecffa92bafe80c6e638503caf64ea63c5505c53
MD5 49f2c007082dc89da56206cfc6e3cae1
BLAKE2b-256 3937b8f15fbce16043dfb1281c331edde68caa350cedf3cd1a0c6a175ac55438

See more details on using hashes here.

Provenance

The following attestation bundles were made for signlangtk-0.1.3-py3-none-any.whl:

Publisher: publish.yml on ed-fish/Sign-Language-Toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page