Cued Speech Processing Tools - Decode and Generate cued speech videos

These details have not been verified by PyPI

Project links

Project description

Cued Speech Processing Tools

A comprehensive Python package for processing cued speech videos with both decoding and generation capabilities. This package provides functionality to decode cued speech videos into subtitled output and generate cued speech videos from text input.

Features

Decoder Features

Real-time Video Processing: Process cued speech videos using MediaPipe Tasks or MediaPipe Holistic for landmark extraction
TFLite Model Support: Native support for MediaPipe .task files (float16, latest models)
Flexible Model Loading: Automatically detects and uses either .task (MediaPipe Tasks API) or .tflite (TFLite Interpreter) files
Neural Network Inference: Use trained CTC models for phoneme recognition
French Language Correction: Apply KenLM language models and homophone correction
Subtitle Generation: Generate subtitled videos with French sentences

Generator Features

Text-to-Cued Speech: Generate cued speech videos from French text input
Whisper Integration: Automatic speech recognition for accurate alignment
MFA Alignment: Montreal Forced Alignment for precise phoneme timing
Hand Gesture Overlay: Realistic hand shape and position rendering
Automatic Synchronization: Perfect alignment between speech and visual cues

Data Management Features

Automatic Data Download: Automatically download required model files and data
GitHub Release Integration: Seamless download from GitHub releases
Smart Caching: Avoid re-downloading existing files
Easy Cleanup: Simple commands to manage downloaded data

General Features

Command Line Interface: Easy-to-use CLI for both decoding and generation
Organized Output Structure: Separate folders for decoder and generator outputs
Extensible Architecture: Modular design for future enhancements
PyPI Ready: Ready for publication and easy installation

Installation

Prerequisites

Python 3.11.*
Pixi (to install Montreal Forced Aligner)

Install with Pixi (Recommended)

Use Pixi to install MFA, then install cued_speech via pip inside the Pixi environment.

1) Install Pixi

macOS/Linux:

curl -fsSL https://pixi.sh/install.sh | bash

Windows (PowerShell):

irm https://pixi.sh/install.ps1 | iex

More options: https://pixi.sh/installation/

2) Create a clean Pixi environment and install MFA

mkdir cued-speech-env && cd cued-speech-env
pixi init
pixi add montreal-forced-aligner=3.3.4
pixi run mfa --version

3) Install the cued_speech package (pip inside Pixi)

pixi run python -m pip install cued-speech

4) Prepare French MFA Models (Required for Generation)

The cued speech generator requires French MFA models (acoustic + dictionary). These are now bundled with the data downloaded by the package. Just download the data, then save the models with MFA:

# Download all required data (includes MFA French models under ./download/)
pixi shell
cued-speech download-data

# Save the French acoustic model to MFA's model store (zip file)
pixi run mfa models save acoustic download/french_mfa.zip --overwrite

# Save the French dictionary model to MFA's model store (.dict file)
pixi run mfa models save dictionary download/french_mfa.dict --overwrite

Note:

You can run the above inside a Pixi shell (pixi shell) or prefix with pixi run as shown.
After saving, MFA will manage models in its own cache (e.g., ~/.local/share/mfa/models/).

5) Verify installation and see available options

pixi shell
cued-speech

Data Setup

The package requires several model files and data for operation. These are automatically downloaded on first use, but you can also manage them manually.

Manual Data Management

You can manage data files manually using the provided commands:

# Download all required data files, verify that you are in the pixi environment
cued-speech download-data 

# List available data files
cued-speech list-data

# Clean up downloaded data files
cued-speech cleanup-data --confirm

Required Data Files

The following files are automatically downloaded to a download/ folder in your current working directory:

Core Decoder Files:

cuedspeech-model.pt - Pre-trained neural network model for phoneme recognition
phonelist.csv - Phoneme vocabulary
lexicon.txt - French lexicon
kenlm_fr.bin - French language model
homophones_dico.jsonl - Homophone dictionary
kenlm_ipa.binary - IPA language model
ipa_to_french.csv - IPA to French mapping

MediaPipe TFLite Models (float16, latest):

face_landmarker.task - Face landmark detection model (478 landmarks, 3.6 MB)
hand_landmarker.task - Hand landmark detection model (21 landmarks per hand, 7.5 MB)
pose_landmarker_full.task - Pose landmark detection model, FULL complexity (33 landmarks, 9.0 MB)

Generator Files:

rotated_images/ - Directory containing hand shape images for generation
french_mfa.dict - MFA dictionary
french_mfa.zip - MFA acoustic model

Test Files:

test_decode.mp4 - Sample video for testing decoder
test_generate.mp4 - Sample video for testing generator

Note: All data files (including TFLite models) are stored in ./download/ relative to where you run the commands, making them easy to find and manage.

Usage

Command Line Interface

The package provides a comprehensive command-line interface for both decoding and generating cued speech videos:

Note: The models are designed for videos at 30 FPS. For best results, use input videos that are 30 FPS.

Decoding (Cued Speech → Text)

Decode a cued speech video into a subtitled video. The decoder uses MediaPipe Tasks API with the latest float16 models for optimal accuracy.

Core Options:

--video_path PATH (default: download/test_decode.mp4): Input cued-speech video
--right_speaker [True|False] (default: True): Whether the speaker uses the right hand
--output_path PATH (default: output/decoder/decoded_video.mp4): Output subtitled video
--auto_download [True|False] (default: True): Auto-download missing data files

Model Paths:

--model_path PATH (default: download/cuedspeech-model.pt): Pretrained neural network model
--vocab_path PATH (default: download/phonelist.csv): Vocabulary file
--lexicon_path PATH (default: download/lexicon.txt): Lexicon file
--kenlm_fr PATH (default: download/kenlm_fr.bin): KenLM model file
--homophones_path PATH (default: download/homophones_dico.jsonl): Homophones dictionary
--kenlm_ipa PATH (default: download/kenlm_ipa.binary): IPA language model

TFLite Model Paths (MediaPipe Tasks):

--face_tflite PATH (default: download/face_landmarker.task): Face landmark model (.task or .tflite)
--hand_tflite PATH (default: download/hand_landmarker.task): Hand landmark model (.task or .tflite)
--pose_tflite PATH (default: download/pose_landmarker_full.task): Pose landmark model (.task or .tflite)

# Basic usage (uses default paths, automatically downloads data if needed)
cued-speech decode

# With custom video path
cued-speech decode --video_path /path/to/your/video.mp4

# Disable automatic data download
cued-speech decode --auto_download False

# Advanced usage with custom TFLite models
cued-speech decode \
    --video_path /path/to/your/video.mp4 \
    --face_tflite /path/to/face_model.task \
    --hand_tflite /path/to/hand_model.task \
    --pose_tflite /path/to/pose_model.task

# Full custom configuration
cued-speech decode \
    --video_path /path/to/your/video.mp4 \
    --output_path output/decoder/my_decoded_video.mp4 \
    --model_path /path/to/custom_model.pt \
    --vocab_path /path/to/custom_vocab.csv \
    --lexicon_path /path/to/custom_lexicon.txt \
    --kenlm_fr /path/to/custom_kenlm.bin \
    --homophones_path /path/to/custom_homophones.jsonl \
    --kenlm_ipa /path/to/custom_lm.binary \
    --face_tflite /path/to/face_model.task \
    --hand_tflite /path/to/hand_model.task \
    --pose_tflite /path/to/pose_model.task \
    --right_speaker True

Note on TFLite Models:

The decoder automatically detects file extensions: .task files use MediaPipe Tasks API, .tflite files use TFLite Interpreter
If TFLite models fail to load, the decoder automatically falls back to MediaPipe Holistic
Models are downloaded automatically with cued-speech download-data

Generation (Video → Cued Speech)

Generate a cued speech video from a video file. Text is extracted with Whisper unless --skip-whisper is used and --text is provided.

Arguments:

VIDEO_PATH (positional): Path to input video file

Options:

--text TEXT (default: None): Provide text manually (otherwise Whisper extracts it)
--output_path PATH (default: output/generator/generated_cued_speech.mp4): Output video path
--audio_path PATH (default: None): Optional audio file (extracted from video if not provided)
--language [french|...] (default: french): Processing language
--skip-whisper (flag): Skip Whisper download/transcription (requires --text)
--easing [linear|ease_in_out_cubic|ease_out_elastic|ease_in_out_back] (default: ease_in_out_cubic): Gesture easing
--morphing/--no-morphing (default: --morphing): Hand shape morphing
--transparency/--no-transparency (default: --transparency): Transparency effects during transitions
--curving/--no-curving (default: --curving): Curved trajectories

# Basic usage (text extracted automatically from video)
cued-speech generate input_video.mp4

# With custom output path
cued-speech generate speaker_video.mp4 --output_path output/generator/my_generated_video.mp4

# With custom audio file
cued-speech generate speaker_video.mp4 --audio_path custom_audio.wav

# With different language
cued-speech generate speaker_video.mp4 --language english

# With manual text (optional)
cued-speech generate speaker_video.mp4 --text "Merci beaucoup pour votre attention"

# Skip Whisper if you have SSL issues
cued-speech generate speaker_video.mp4 --skip-whisper --text "Merci beaucoup pour votre attention"

Output Structure

The package organizes outputs in a structured way:

output/
├── decoder/           # Decoded videos with subtitles
│   └── decoded_video.mp4
└── generator/         # Generated cued speech videos
    ├── audio.wav           # Extracted/processed audio
    ├── audio.TextGrid      # MFA alignment results
    ├── rendered_video.mp4  # Video with hand cues (no audio)
    ├── final_rendered_video.mp4  # Final output with audio
    └── mfa_input/          # MFA temporary files

Python API

You can also use the package programmatically:

Decoder API

from cued_speech import decode_video

# Decode a cued speech video
decode_video(
    video_path="input.mp4",
    right_speaker=True,
    model_path="/path/to/model.pt",
    output_path="output/decoder/decoded.mp4",
    vocab_path="/path/to/vocab.csv",
    lexicon_path="/path/to/lexicon.txt",
    kenlm_model_path="/path/to/kenlm.bin",
    homophones_path="/path/to/homophones.jsonl",
    lm_path="/path/to/lm.binary"
)

Generator API

from cued_speech import generate_cue

# Generate a cued speech video (text extracted automatically)
result_path = generate_cue(
    text=None,  # Will be extracted from video using Whisper
    video_path="speaker_video.mp4",
    output_path="output/generator/generated.mp4",
    audio_path=None,  # Will extract from video
    config={
        "language": "french",
        "hand_scale_factor": 0.75,
        "video_codec": "libx264",
        "audio_codec": "aac"
    }
)
print(f"Generated video saved to: {result_path}")

# Or with manual text
result_path = generate_cue(
    text="Bonjour tout le monde",
    video_path="speaker_video.mp4",
    output_path="output/generator/generated.mp4"
)

Architecture

Core Components

Decoder Components

MediaPipe Integration:
- MediaPipe Tasks API (default): Uses latest float16 models with native .task file support
- MediaPipe Holistic (fallback): Traditional MediaPipe solution
- Automatic model detection and loading based on file extension
Feature Extraction: Processes landmarks into hand shape, position, and lip features
Neural Network: Three-stream fusion encoder with CTC output
Language Model: KenLM-based beam search for French sentence correction
Video Processing: Generates subtitled output with synchronized audio

Generator Components

Whisper Integration: Automatic speech recognition for transcription
MFA Alignment: Montreal Forced Alignment for precise phoneme timing
Cue Mapping: Maps phonemes to hand shapes and positions using cued speech rules
Hand Rendering: Overlays realistic hand gestures onto video frames
Synchronization: Ensures perfect timing between speech and visual cues

Model Architecture

Decoder Architecture

The decoder uses a three-stream fusion encoder:

Hand Shape Stream: Processes hand landmark positions and geometric features
Hand Position Stream: Analyzes hand movement and positioning
Lips Stream: Extracts lip movement and facial features

Generator Architecture

The generator follows a multi-stage pipeline:

Audio Processing: Whisper-based transcription and feature extraction
Phoneme Alignment: MFA-based precise timing alignment
Cue Generation: Rule-based mapping from phonemes to hand configurations
Video Rendering: Real-time hand overlay with facial landmark tracking

Processing Pipeline

Decoding Pipeline

Video Input: Load and process video frames
Landmark Extraction: Use MediaPipe to extract hand and face landmarks
Feature Computation: Calculate geometric and temporal features
Model Inference: Run CTC model to predict phonemes
Language Correction: Apply beam search with language models
Subtitle Generation: Create output video with French subtitles

Generation Pipeline

Text Input: Process French text for cued speech generation
Audio Extraction: Extract or use provided audio track
Speech Recognition: Use Whisper for accurate transcription
Phoneme Alignment: Apply MFA for precise timing
Cue Mapping: Map phonemes to hand shapes and positions
Video Rendering: Overlay hand cues with perfect synchronization

License

This project is licensed under the MIT License - see the LICENSE file for details.

TFLite Models Information

Model Details

The decoder uses the latest MediaPipe float16 models for optimal accuracy:

Model	Landmarks	Size	Precision	Complexity
Face Landmarker	478 points	3.6 MB	float16	Standard
Hand Landmarker	21 points/hand	7.5 MB	float16	Standard
Pose Landmarker FULL	33 points	9.0 MB	float16	Highest

Model Sources

Models are automatically downloaded from official MediaPipe repositories:

Face: mediapipe-models/face_landmarker
Hand: mediapipe-models/hand_landmarker
Pose: mediapipe-models/pose_landmarker_full

Advantages Over MediaPipe Holistic

Higher Quality: Float16 precision with latest model versions
More Landmarks: Face model provides 478 landmarks (vs 468 in older models)
Better Pose Estimation: FULL complexity model for more accurate body tracking
Mobile-Ready: Same .task files work seamlessly in Flutter mobile apps
Future-Proof: Direct access to latest MediaPipe models as they're updated

Manual Model Management

If you need to download models separately:

# Download individual models
curl -L -o download/face_landmarker.task \
  https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task

curl -L -o download/hand_landmarker.task \
  https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task

curl -L -o download/pose_landmarker_full.task \
  https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_full/float16/latest/pose_landmarker_full.task

Or use the provided script (legacy, for separate downloads):

bash download_tflite_models.sh

Acknowledgments

MediaPipe and MediaPipe Tasks API for landmark extraction
Google for providing high-quality TFLite models
PyTorch for deep learning framework
KenLM for language modeling
The cued speech research community

Support

For questions and support:

Contact: boubasow.pro@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.2

Nov 3, 2025

0.4.1

Oct 29, 2025

0.4.0

Oct 22, 2025

This version

0.3.6

Oct 16, 2025

0.3.5

Oct 7, 2025

0.3.4

Oct 7, 2025

0.3.3

Oct 6, 2025

0.3.2

Oct 6, 2025

0.3.1

Oct 6, 2025

0.3.0

Oct 6, 2025

0.2.59

Oct 3, 2025

0.2.58

Oct 3, 2025

0.2.54

Oct 2, 2025

0.2.53

Oct 2, 2025

0.2.52

Oct 2, 2025

0.2.51

Oct 2, 2025

0.2.5

Oct 2, 2025

0.2.0

Oct 1, 2025

0.1.0

Sep 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cued_speech-0.3.6.tar.gz (144.5 kB view details)

Uploaded Oct 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cued_speech-0.3.6-py3-none-any.whl (150.1 kB view details)

Uploaded Oct 16, 2025 Python 3

File details

Details for the file cued_speech-0.3.6.tar.gz.

File metadata

Download URL: cued_speech-0.3.6.tar.gz
Upload date: Oct 16, 2025
Size: 144.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cued_speech-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`88ffad367d10fbcf16921b8881d16990fee75eb20da7b406452381fbb8d570d8`
MD5	`3db36af94df4e1ce4074e1553a716d7e`
BLAKE2b-256	`a48be9d84481e9275af492e8a5f81c0bec8cc1fbc557578a32c7a5640d2ad7a5`

See more details on using hashes here.

File details

Details for the file cued_speech-0.3.6-py3-none-any.whl.

File metadata

Download URL: cued_speech-0.3.6-py3-none-any.whl
Upload date: Oct 16, 2025
Size: 150.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cued_speech-0.3.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8ddc235362c5f72c65fc48bb431d72f695a2b702759b6248e1558309a94a23f`
MD5	`ef479dc530b86d81fa6705b51a2295e6`
BLAKE2b-256	`2e43dca2fac72a64c48e3a3dbbb126e80ae1bda3c4fb7c49a901bf97270d6b39`

See more details on using hashes here.

cued-speech 0.3.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Cued Speech Processing Tools

Features

Decoder Features

Generator Features

Data Management Features

General Features

Installation

Prerequisites

Install with Pixi (Recommended)

1) Install Pixi

2) Create a clean Pixi environment and install MFA

3) Install the cued_speech package (pip inside Pixi)

4) Prepare French MFA Models (Required for Generation)

5) Verify installation and see available options

Data Setup

Manual Data Management

Required Data Files

Usage

Command Line Interface

Decoding (Cued Speech → Text)

Generation (Video → Cued Speech)

Output Structure

Python API

Decoder API

Generator API

Architecture

Core Components

Decoder Components

Generator Components

Model Architecture

Decoder Architecture

Generator Architecture

Processing Pipeline

Decoding Pipeline

Generation Pipeline

License

TFLite Models Information

Model Details

Model Sources

Advantages Over MediaPipe Holistic

Manual Model Management

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes