Skip to main content

Build custom Translators and Translate between text and sign language videos with AI.

Project description

Sign Language Translator ⠎⠇⠞

SLT: Sign Language Translator

Build Custom Translators and Translate between Sign Language & Text with AI.

python PyPi GitHub branch check runs codecov Documentation Status
GitHub Repo stars Repository Views per Month Downloads
HuggingFace Spaces

Support Us ❤️ PayPal

  1. Overview
    1. Solution
    2. Major Components
    3. Goals
  2. Installation 🛠️
  3. Usage
    1. Web 🌐
    2. Python 🐍
    3. Command Line >_
  4. Languages
  5. Models
  6. How to Build a Translator for your Sign Language
  7. Module Hierarchy
  8. How to Contribute
  9. Citation, License & Research Papers
  10. Credits and Gratitude

Overview

Sign language consists of gestures and expressions used mainly by the hearing-impaired to talk. This project is an effort to bridge the communication gap between the hearing and the hearing-impaired community using Artificial Intelligence.

This python library provides a user-friendly translation API and a framework for building sign language translators that can easily adapt to any regional sign language...
A big hurdle is the lack of datasets (global & regional) and frameworks that deep learning engineers and software developers can use to build useful products for the target community. This project aims to empower sign language translation by providing robust components, tools, datasets and models for both sign language to text and text to sign language conversion. It aims to facilitate the creation of sign language translators for any region, while building the way towards sign language standardization.
Unlike most other projects, this python library can translate full sentences and not just the alphabet.

Solution

This package comes with an extensible rule-based text-to-sign translation system that can be used to generate training data for Deep Learning models for both sign to text & text to sign translation.

[!Tip] To create a rule-based translation system for your regional language, you can inherit the TextLanguage and SignLanguage classes and pass them as arguments to the ConcatenativeSynthesis class. To write sample texts of supported words, you can use our language models. Then, you can use that system to fine-tune our deep learning models.

See the documentation and our datasets for details.

Major Components

  1. Sign language to Text
    1. Extract features from sign language videos
      1. See the slt.models.video_embedding sub-package and the $ slt embed command.
      2. Currently Mediapipe 3D landmarks are being used for deep learning.
    2. Transcribe and translate signs into multiple text languages to generalize the model.
    3. To train for word-for-word gloss writing task, also use a synthetic dataset made by concatenating signs for each word in a text. (See slt.models.ConcatenativeSynthesis)
    4. Fine-tune a neural network, such as one from slt.models.sign_to_text or the encoder of any multilingual seq2seq model, on your dataset.
  2. Text to Sign Language

    There are two approaches to this problem:

    1. Rule Based Concatenation

      1. Label a Sign Language Dictionary with all word tokens that can be mapped to those signs. See our mapping format here.
      2. Parse the input text and play appropriate video clips for each token.
        1. Build a text processor by inheriting slt.languages.TextLanguage (see slt.languages.text sub-package for details)
        2. Map the text grammar & words to sign language by inheriting slt.languages.SignLanguage (see slt.languages.sign sub-package for details)
        3. Use our rule-based model slt.models.ConcatenativeSynthesis for translation.
      3. It is faster but the word sense has to be disambiguated in the input. See the deep learning approach to automatically handle ambiguous words & words not in dictionary.
    2. Deep learning (seq2seq)

      1. Either generate the sequence of filenames that should be concatenated
        1. you will need a parallel corpus of normal text sentences against sign language gloss (sign sequence written word-for-word)
      2. Or synthesize the signs directly by using a pre-trained multilingual text encoder and
        1. a GAN or diffusion model or decoder to synthesize a sequence of pose vectors (shape = (time, num_landmarks * num_coordinates))
          1. Move an Avatar with those pose vectors (Easy)
          2. Use motion transfer to generate a video (Medium)
          3. Synthesize a video frame for each vector (Difficult)
        2. a video synthesis model (Very Difficult)
  3. Language Processing
    1. Sign Processing
      • 3D world landmarks extraction with Mediapipe.
      • Pose Visualization with matplotlib.
      • Pose transformations (data augmentation) with scipy.
    2. Text Processing
      • Normalize text input by substituting unknown characters/spellings with supported words.
      • Disambiguate context-dependent words to ensure accurate translation. "spring" -> ["spring(water-spring)", "spring(metal-coil)"]
      • Tokenize text (word & sentence level).
      • Classify tokens and mark them with Tags.
  4. Datasets

    For our datasets & conventions, see the sign-language-datasets repo and its releases. See this documentation for more on building a dataset of Sign Language videos (or motion capture gloves' output features).

    Your data should include:

    1. A word level Dictionary (Videos of individual signs & corresponding Text tokens (words & phrases))
    2. Replications of the dictionary. (Set up multiple syncronized cameras and record random people performing the dictionary videos. (notebook))
    3. Parallel sentences
      1. Normal text language sentences against sign language videos. (Use our Language Models to generate sentences composed of dictionary words.)
      2. Normal text language sentences against the text gloss of the corresponding sign language sentence.
      3. Sign language sentences against their text gloss
      4. Sign language sentences against translations in multiple text languages
    4. Grammatical rules of the sign language
      1. Word order (e.g. SUBJECT OBJECT VERB TIME)
      2. Meaningless words (e.g. "the", "am", "are")
      3. Ambiguous words (e.g. spring(coil) & spring(water-fountain))

    Try to incorporate:

    1. Multiple camera angles
    2. Diverse performers to capture all accents of the signs
    3. Uniqueness in labeling of word tokens
    4. Variations in signs for the same concept

    Try to capture variations in signs in a scalable and diversity accommodating way and enable advancing sign language standardization efforts.

Goals

  1. Enable integration of sign language into existing applications.
  2. Assist construction of custom solutions for resource poor sign langauges.
  3. Improve education quality for the deaf and elevate literacy rates.
  4. Promote communication inclusivity of the hearing impaired.
  5. Establish a framework for sign language standardization.

How to install the package

pip install sign-language-translator
Editable mode (git clone):

The package ships with some optional dependencies as well (e.g. deep_translator for synonym finding and mediapipe for a pretrained pose extraction model). Install them by appending [all], [full], [mediapipe] or [synonyms] to the project name in the command (e.g pip install sign-langauge-translator[full]).

git clone https://github.com/sign-language-translator/sign-language-translator.git
cd sign-language-translator
pip install -e ".[all]"
pip install -e git+https://github.com/sign-language-translator/sign-language-translator.git#egg=sign_language_translator

Usage

Head over to slt.readthedocs.io to see the detailed usage in Python, CLI and gradio GUI. See the test cases or the notebooks repo to see the internal code in action.

Web GUI

Individual models deployed on HuggingFace Spaces:

HuggingFace Spaces

Python

import sign_language_translator as slt

# The core model of the project (rule-based text-to-sign translator)
# which enables us to generate synthetic training datasets
model = slt.models.ConcatenativeSynthesis(
   text_language="urdu", sign_language="pk-sl", sign_format="video" )

text = "یہ بہت اچھا ہے۔" # "this-very-good-is"
sign = model.translate(text) # tokenize, map, download & concatenate
sign.show()

model.sign_format = slt.SignFormatCodes.LANDMARKS
model.sign_embedding_model = "mediapipe-world"

# ==== English ==== #
model.text_language = slt.languages.text.English()
sign_2 = model.translate("This is an apple.")
sign_2.save("this-is-an-apple.csv", overwrite=True)

# ==== Hindi ==== #
model.text_language = slt.TextLanguageCodes.HINDI
sign_3 = model.translate("कैसे हैं आप?") # "how-are-you"
sign_3.save_animation("how-are-you.gif", overwrite=True)
this very good is how are you (landmark 3d plot)
"یہ بہت اچھا ہے۔" (this-very-good-is) "कैसे हैं आप?" (how-are-you)

import sign_language_translator as slt

# sign = slt.Video("path/to/video.mp4")
sign = slt.Video.load_asset("pk-hfad-1_aap-ka-nam-kya(what)-hy")  # your name what is? (auto-downloaded)
sign.show_frames_grid()

# Extract Pose Vector for feature reduction
embedding_model = slt.models.MediaPipeLandmarksModel()      # pip install "sign_language_translator[mediapipe]"  # (or [all])
embedding = embedding_model.embed(sign.iter_frames())

slt.Landmarks(embedding.reshape((-1, 75, 5)),
              connections="mediapipe-world"  ).show()

# # Load sign-to-text model (pytorch) (COMING SOON!)
# translation_model = slt.get_model(slt.ModelCodes.Gesture)
# text = translation_model.translate(embedding)
# print(text)
# custom translator (https://slt.readthedocs.io/en/latest/#building-custom-translators)
help(slt.languages.SignLanguage)
help(slt.languages.text.Urdu)
help(slt.models.ConcatenativeSynthesis)

Command Line

$ slt

Usage: slt [OPTIONS] COMMAND [ARGS]...
   Sign Language Translator (SLT) command line interface.
   Documentation: https://sign-language-translator.readthedocs.io
Options:
  --version  Show the version and exit.
  --help     Show this message and exit.
Commands:
  assets     Assets manager to download & display Datasets & Models.
  complete   Complete a sequence using Language Models.
  embed      Embed Videos Using Selected Model.
  translate  Translate text into sign language or vice versa.

Generate training examples: write a sentence with a language model and synthesize a sign language video from it with a single command:

slt translate --model-code rule-based --text-lang urdu --sign-lang pk-sl --sign-format video \
"$(slt complete '<' --model-code urdu-mixed-ngram --join '')"

Languages

Text Languages

Available Functions:

  • Text Normalization
  • Tokenization (word, phrase & sentence)
  • Token Classification (Tagging)
  • Word Sense Disambiguation
Name Vocabulary Ambiguous tokens Signs
English 1591 words+phrases 167 776
Urdu 2080 words+phrases 227 776
Hindi 137 words+phrases 5 84
Sign Languages

Available Functions:

  • Word & phrase mapping to signs
  • Sentence restructuring according to grammar
  • Sentence simplification (drop stopwords)
Name Vocabulary Dataset Parallel Corpus
Pakistan Sign Language 776 23 hours details

Models

Translation: Text to sign Language
Name Architecture Description Input Output Web Demo
Concatenative Synthesis Rules + Hash Tables The Core Rule-Based translator mainly used to synthesize translation dataset.
Initialize it using TextLanguage, SignLanguage & SignFormat objects.
string slt.Video | slt.Landmarks Open in Spaces
Sign Embedding/Feature extraction:
Name Architecture Description Input format Output format
MediaPipe Landmarks
(Pose + Hands)
CNN based pipelines. See Here: Pose, Hands Encodes videos into pose vectors (3D world or 2D image) depicting the movements of the performer. List of numpy images
(n_frames, height, width, channels)
torch.Tensor
(n_frames, n_landmarks * 5)
Data generation: Language Models

Available Trained models

Name Architecture Description Input format Output format
N-Gram Langauge Model Hash Tables Predicts the next token based on learned statistics about previous N tokens. List of tokens (token, probability)
Transformer Language Model Decoder-only Transformers (GPT) Predicts next token using query-key-value attention, linear transformations and soft probabilities. torch.Tensor
(batch, token_ids)

List of tokens
torch.Tensor
(batch, token_ids, vocab_size)

(token, probability)
Text Embedding:

Available Trained models

Name Architecture Description Input format Output format
Vector Lookup HashTable Finds token index and returns the coresponding vector. Tokenizes sentences and computes average vector of known tokens. string torch.Tensor
(n_dim,)

How to Build a Translator for Sign Language

To create your own sign language translator, you'll need these essential components:

  1. Data Collection
    1. Gather a collection of dictionary videos (word level) featuring individuals performing sign language gestures. These can be obtained from schools & organizations for the deaf. You should record multiple people perform the same sign to capture various accents of the sign. Set up multiple cameras in different locations in parallel to further augment the data.
    2. Prepare a JSON file that maps dictionary video file names to corresponding text language words & phrases that are synonymous with the gestures.
    3. Prepare a synthetic data parallel corpus containing text language sentences and sequences of sign language video filenames. You can use langauge models to generate these sentences & sequences.
    4. Prepare a dataset of sign language sentence videos that are labeled with translations & glosses in multiple text languages.
  2. Language Processing
    1. Implement a subclass of slt.languages.TextLanguage:
      • Tokenize your text language and assign appropriate tags to the tokens for streamlined processing.
    2. Create a subclass of slt.languages.SignLanguage:
      • Map text tokens to video filenames using the provided JSON data.
      • Rearrange the sequence of video filenames to align with the grammar and structure of sign language.
  3. Rule-Based Translation
    1. Pass instances of your classes from the previous step to slt.models.ConcatenativeSynthesis class to obtain a rule-based translator object.
    2. Construct sentences in your text language and use the rule-based translator to generate sign language translations. (You can use our language models to generate such texts.)
  4. Deep Learning Model Fine-Tuning
    1. Utilize the (synthetic & real) sign language videos and corresponding text sentences from the previous step.
    2. Apply our training pipeline to fine-tune a chosen model for improved accuracy and translation quality.

Remember to contribute back to the community:

  • Share your data, code, and models by creating a pull request, allowing others to benefit from your efforts.
  • Create your own sign language translator (e.g. as your university thesis) and contribute to a more inclusive and accessible world.

See the code at Build Custom Translator section in ReadTheDocs or in this notebook. Open In Colab

Module Hierarchy

sign-language-translator (Click to see file descriptions)
├── README.md
├── pyproject.toml
├── requirements.txt
├── docs
│   └── *
├── tests
│   └── *
│
└── sign_language_translator
    ├── cli.py `> slt` command line interface
    ├── assets (auto-downloaded)
    │   └── *
    │
    ├── config
    │   ├── assets.py download, extract and remove models & datasets
    │   ├── colors.py named RGB tuples for visualization
    │   ├── enums.py string short codes to identify models & classes
    │   ├── settings.py global variables in repository design-pattern
    │   ├── urls.json
    │   └── utils.py
    │
    ├── languages
    │   ├── utils.py
    │   ├── vocab.py reads word mapping datasets
    │   ├── sign
    │   │   ├── mapping_rules.py strategy design-pattern for word to sign mapping
    │   │   ├── pakistan_sign_language.py
    │   │   └── sign_language.py Base class for text to sign mapping and sentence restructuring
    │   │
    │   └── text
    │       ├── english.py
    │       ├── hindi.py
    │       ├── text_language.py Base class for text normalization, tokenization & tagging
    │       └── urdu.py
    │
    ├── models
    │   ├── _utils.py
    │   ├── utils.py
    │   ├── language_models
    │   │   ├── abstract_language_model.py
    │   │   ├── beam_sampling.py
    │   │   ├── mixer.py wrap multiple language models into a single object
    │   │   ├── ngram_language_model.py uses hash-tables & frequency to predict next token
    │   │   └── transformer_language_model
    │   │       ├── layers.py
    │   │       ├── model.py decoder-only transformer with controllable vocabulary
    │   │       └── train.py
    │   │
    │   ├── sign_to_text
    │   ├── text_to_sign
    │   │   ├── concatenative_synthesis.py join sign clip of each word in text using rules
    │   │   └── t2s_model.py Base class
    │   │
    │   ├── text_embedding
    │   │   ├── text_embedding_model.py Base class
    │   │   └── vector_lookup_model.py retrieves word embedding from a vector database
    │   │
    │   └── video_embedding
    │       ├── mediapipe_landmarks_model.py 2D & 3D coordinates of points on body
    │       └── video_embedding_model.py Base class
    │
    ├── text
    │   ├── metrics.py numeric score techniques
    │   ├── preprocess.py
    │   ├── subtitles.py WebVTT
    │   ├── synonyms.py
    │   ├── tagger.py classify tokens to assist in mapping
    │   ├── tokenizer.py break text into words, phrases, sentences etc
    │   └── utils.py
    │
    ├── utils
    │   ├── archive.py zip datasets
    │   ├── arrays.py common interface & operations for numpy.ndarray and torch.Tensor
    │   ├── download.py
    │   ├── parallel.py multi-threading
    │   ├── tree.py print file hierarchy
    │   └── utils.py
    │
    └── vision
        ├── _utils.py
        ├── utils.py
        ├── landmarks
        │   ├── connections.py drawing configurations for different landmarks models
        │   ├── display.py visualize points & lines on 3D plot
        │   └── landmarks.py wrapper for sequence of collection of points on body
        │
        ├── sign
        │   └── sign.py Base class to wrap around sign clips
        │
        └── video
            ├── display.py jupyter notebooks inline video & pop-up in CLI
            ├── transformations.py strategy design-pattern for image augmentation
            ├── video_iterators.py adapter design-pattern for video reading
            └── video.py

How to Contribute

Datasets:

See our datasets & conventions here.

  • Contribute by scraping, compiling, and centralizing video datasets.
  • Help with labeling word mapping datasets.
  • Establish connections with Academies for the Deaf to collaboratively develop standardized sign language grammar and integrate it into the rule-based translators.
New Code:
  • Create dedicated sign language classes catering to various regions.
  • Develop text language processing classes for diverse languages.
  • Experiment with training models using diverse hyper-parameters.
  • Don't forget to integrate string short codes of your classes and models into enums.py, and ensure to update factory functions like get_model() and get_.*_language().
  • Enhance the codebase with comprehensive docstrings, exemplary usage cases, and thorough test cases.
Existing Code:
  • Implement the # ToDo from the code or fix # bug / # type: ignore or anything from the roadmap.
  • Optimize the codebase by implementing techniques like parallel processing and batching.
  • Strengthen the project with clear docstrings containing illustrative examples and with robust test coverage.
  • Contribute to the documentation for sign-language-translator ReadTheDocs to empower users with comprehensive insights. Currently it needs a better template for the auto-generated pages.
Product Development:
  • Engage in the development efforts across MLOps & web-frontend domains, depending on your expertise and interests.

Upcoming/Roadmap

LANDMARKS_WRAPPER: v0.8
# 0.8.2: landmark augmentation (zoom, rotate, move, noise, duration, rectify, stabilize, __repr__)
# 0.8.3: trim signs before concatenation, insert transition frames
# 0.8.4: plotly & three.js/mixamo display , pass matplotlib kwargs all the way down

# 0.8.5: subtitles/captions
# 0.8.6: stabilize video batch using landmarks, draw/overlay 2D landmarks on video/image
CLEAN_UP: v0.9
# mock test cases which require internet when internet isn't available / test for  dummy languages
# improve langauge classes architecture (for easy customization via inheritance) | clean-up slt.languages.text.* code
# ? add a generic SignedTextLanguage class which just maps text lang to signs based on mappinng.json ?
# add progress bar to slt.models.MediaPipeLandmarksModel

# rename 'country' to 'region' & rename wordless_wordless to wordless.mp4 # insert video type to archives: .*.videos-`(dictionary|sentences)(-replication)?`-mp4.zip
# decide mediapipe-all = world & image concactenated in landmark dim or feature dim?
# expand dictionary video data by scraping everything
# upload the 12 person dictionary replication landmark dataset
DEEP_TRANSLATION: v0.9 - v1.2
# 0.9.1: TransformerLanguageModel - Drop space tokens & bidirectional prediction. infer on specific vocab only .... pretrain on max vocab and mixed data. finetune on balanced data (wiki==news==novels==poetry==reviews) .... then RLHF on coherent generations (Comparison data: generate 100 examples (at high temperature) and cut them at random points and regerate the rest and label these pairs for coherence[ and novelity].) (use same model/BERT as reward model with regression head.) (ranking loss with margin) (each token is a time step) (min KL Divergance from base - exploration without mode collapse) ... label disambiguation data and freeze all and finetune disambiguated_tokens_embeddings  (disambiguated embedding: word ± 0.1*(sense1 - sense2).normalize()) .... generate data on broken compound words and finetune their token_embeddings ... generate sentences of supported words and translate to other languages.
# 0.9.2: sign to text with custom seq2seq transformer
# 0.9.3: pose vector generation from text with custom seq2seq transformer
# 0.9.4: sign to text with fine-tuned whisper
# 0.9.5: pose vector generation with fine-tuned mBERT
# 0.9.6: custom 3DLandmark model (training data = mediapipe's output on activity recognition or any dataset)
# 1.0.0: all models trained on custom landmark model
# 🎉
# 1.0.1: video to text model (connect custom landmark model with sign2text model and finetune)
# 1.1.0: motion transfer
# 1.1.1: custom pose2video: stable diffusion or GAN?
# 1.2.0: speech to sign
# 1.2.1: sign to speech
MISCELLANEOUS

Issues

# bugfix:      inaccurate num_frames in video file metadata
# bugfix:      Expression of type "Literal[False]" cannot be assigned to member "SHOW_DOWNLOAD_PROGRESS" of class "Settings"
# feature:     video transformations (e.g. stabilization with image pose landmarks, watermark text/logo)
# improvement: SignFilename.parse("videos/pk-hfad-1_airplane.mp4").gloss  # airplane

Miscellaneous

# parallel text corpus
# clean demonstration notebooks
# * host video dataset online, descriptive filenames
# dataset info table
# sequence diagram for creating a translator
# GUI with gradio or something

Research Papers

# datasets: clips, text, sentences, disambiguation
# rule based translation: describe entire repo
# deep sign-to-text: pipeline + experiments
# deep text-to-sign: pipeline + experiments

Servers / Product

# ML inference server
# Django backend server
# React Native mobile app

Total Views

Citation, Licence & Research Papers

@software{mdsr2023slt,
  author       = {Mudassar Iqbal},
  title        = {Sign Language Translator: Python Library and AI Framework},
  year         = {2023},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/sign-language-translator/sign-language-translator}},
}

This project is licensed under the Apache 2.0 License. You are permitted to use the library, create modified versions, or incorporate pieces of the code into your own work. Your product or research, whether commercial or non-commercial, must provide appropriate credit to the original author(s) by citing this repository.

Stay Tuned for research Papers!

Credits and Gratitude

This project started in October 2021 as a BS Computer Science final year project with 3 students and 1 supervisor. After 9 months at university, it became a hobby project for Mudassar who has continued it till at least 2024-09-23.

Bonus

Count total number of lines of code (Package: 14,034 + Tests: 2,928):

git ls-files | grep '\.py' | xargs wc -l

Just for fun 🙃

Q: What was the deaf student's favorite course?
A: Communication skills
Q: Why was the ML engineer sad?
A: Triplet loss
Star History Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sign_language_translator-0.8.1.tar.gz (157.8 kB view details)

Uploaded Source

Built Distribution

sign_language_translator-0.8.1-py3-none-any.whl (181.5 kB view details)

Uploaded Python 3

File details

Details for the file sign_language_translator-0.8.1.tar.gz.

File metadata

File hashes

Hashes for sign_language_translator-0.8.1.tar.gz
Algorithm Hash digest
SHA256 953ac2478bc200f8ea8a326e77f8a7169b597bcfd98c981a8ad5157568cfb435
MD5 e5cd466feb2b6c509d32d449f75a2421
BLAKE2b-256 79f6692dfd5617c375d8be461169ea79f77554645b5d9018ea76014ff2948b36

See more details on using hashes here.

File details

Details for the file sign_language_translator-0.8.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sign_language_translator-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44da96ce1aae338b0417e7d504cfb6c65042650e25bd7beb5b5e2fbb593030dd
MD5 25f449410fc74b7c2460acae0e3f329a
BLAKE2b-256 2e0078cf28a90818bad3bb7aef480a3eff31629bdf613b24c6424aecc9ae1f7d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page