Skip to main content

Build Translators and Translate between text languages & sign language videos with AI.

Project description

Sign Language Translator ⠎⠇⠞

python PyPi Downloads

Release Workflow Status codecov Documentation Status

  1. Overview
    1. Solution
    2. Major Components and Goals
  2. How to install the package
  3. Usage
  4. Languages
  5. Models
  6. How to Build a Translator for your Sign Language
  7. Directory Tree
  8. How to Contribute
  9. Research Papers & Citation
  10. Upcoming/Roadmap
  11. Credits and Gratitude
  12. Bonus
    1. Number of lines of code
    2. :)

Overview

Sign language consists of gestures and expressions used mainly by the hearing-impaired to talk. This project is an effort to bridge the communication gap between the hearing and the hearing-impaired community using Artificial Intelligence.

The goal is to provide a user friendly API to novel Sign Language Translation solutions that can easily adapt to any regional sign language. Unlike most other projects, this python library can translate full sentences and not just the alphabet.

A bigger hurdle is the lack of datasets and frameworks that deep learning engineers and software developers can use to build useful products for the target community. This project aims to empower sign language translation by providing robust components, tools and models for both sign language to text and text to sign language conversion. It seeks to advance the development of sign language translators for various regions while providing a way towards sign language standardization.

Solution

I've have built an extensible rule-based text-to-sign translation system that can be used to generate training data for Deep Learning models for both sign to text & text to sign translation.

To create a rule-based translation system for your regional language, you can inherit the TextLanguage and SignLanguage classes and pass them as arguments to the ConcatenativeSynthesis class. To write sample texts of supported words, you can use our language models. Then, you can use that system to fine-tune our AI models. See the documentation for more.

Major Components and Goals

  1. Sign language to Text
    • Extract pose vectors (2D or 3D) from videos and map them to corresponding text representations of the performed signs.

    • Fine-tuned a neural network, such as a state-of-the-art speech-to-text model, with gradual unfreezing starting from the input layers to convert pose vectors to text.

  2. Text to Sign Language
    • This is a relatively easier task if you parse input text and play appropriate video clips for each word.
    1. Motion Transfer
      • Concatenate pose vectors in the time dimension and transfer the movements onto any given image of a person. This ensures smooth transitions between video clips.
    2. Sign Feature Synthesis
      • Condition a pose sequence generation model on a pre-trained text encoder (e.g., fine-tune decoder of a multilingual T5) to output pose vectors instead of text tokens. This solves challenges related to unknown synonyms or hard-to-tokenize/process words or phrases.
  3. Language Processing Utilities
    1. Sign Processing
      • 3D world landmarks extraction with Mediapipe.
      • Pose Visualization with matplotlib and moviepy.
      • Pose transformations (data augmentation) with scipy.
    2. Text Processing
      • Normalize text input by substituting unknown characters/spellings with supported words.
      • Disambiguate context-dependent words to ensure accurate translation. "spring" -> ["spring(water-spring)", "spring(metal-coil)"]
      • Tokenize text (word & sentence level).
      • Classify tokens and mark them with Tags.
  4. Data Collection and Creation
    • Capture variations in signs in a scalable and diversity accommodating way and enable advancing sign language standardization efforts.

      1. Clip extraction from long videos using timestamps
      2. Multithreaded Web scraping
      3. Language Models to generate sentences composed of supported word
  5. Datasets

    The sign videos are categorized by:

    1. country
    2. source organization
    3. session number
    4. camera angle
    5. person code ((d: deaf | h: hearing)(m: male | f: female)000001)
    6. equivalent text language word
    

    The files are labeled as follows:

    country_organization_sessionNumber_cameraAngle_personCode_word.extension
    

    The text data includes:

    7. word/sentence mappings to videos
    8. spoken language sentences and phrases
    9. spoken language sentences & corresponding sign video label sequences
    10. preprocessing data such as word-to-numbers, misspellings, named-entities etc
    

    See the sign-language-datasets repo and its release files for the actual data & details

How to install the package

pip install sign-language-translator
Editable mode:
git clone https://github.com/sign-language-translator/sign-language-translator.git
cd sign-language-translator
pip install -e .
pip install -e git+https://github.com/sign-language-translator/sign-language-translator.git#egg=sign_language_translator

Usage

Head over to sign-language-translator.readthedocs.io to see the detailed usage in Python, Command line and GUI.

See the test cases or the notebooks repo to see the internal code in action.

Also see How to build a custom sign language translator.

$ slt

Usage: slt [OPTIONS] COMMAND [ARGS]...
   Sign Language Translator (SLT) command line interface.
   Documentation: https://sign-language-translator.readthedocs.io
Options:
  --version  Show the version and exit.
  --help     Show this message and exit.
Commands:
  complete   Complete a sequence using Language Models.
  download   Download resource files with regex.
  embed      Embed Videos Using Selected Model.
  translate  Translate text into sign language or vice versa.
# Documentation: https://sign-language-translator.readthedocs.io
import sign_language_translator as slt
help(slt)

# The core model of the project (rule-based text-to-sign translator)
# which enables us to generate synthetic training datasets
model = slt.models.ConcatenativeSynthesis(
   text_language="urdu", sign_language="psl", sign_format="video"
)
text = "سیب اچھا ہے"
sign = model.translate(text) # tokenize, map, download & concatenate
sign.show(inline_player="html5") # jupyter notebook
sign.save(f"{text}.mp4")

# # Load any model
# # print(list(slt.ModelCodes))
# model = slt.get_model(slt.ModelCodes.Gesture) # sign-to-text (pytorch)
# sign = slt.Video("video.mp4")
# text = model.translate(sign)
# print(text)
# # sign.show()

# # DocStrings
# help(slt.languages.SignLanguage)
# help(slt.languages.text.Urdu)
# help(slt.Video)
# help(slt.models.MediaPipeLandmarksModel)
# help(slt.models.TransformerLanguageModel)

https://github.com/sign-language-translator/sign-language-translator/assets/118578823/b5da28ef-d04d-44c0-9ed8-1343ac004255

Languages

Text Languages

Available Functions:

  • Text Normalization
  • Tokenization (word, phrase & sentence)
  • Token Classification (Tagging)
  • Word Sense Disambiguation
Name Vocabulary Ambiguous tokens Signs
Urdu 2090 words+phrases 227 790
Sign Languages

Available Functions:

  • Word & phrase mapping to signs
  • Sentence restructuring according to grammar
  • Sentence simplification (drop stopwords)
Name Vocabulary Dataset Parallel Corpus
Pakistan Sign Language 789 3 hours n transcribed sentences with translations in m text languages

Models

Translation: Text to sign Language
Name Architecture Description Input Output
Concatenative Synthesis Rules + Hash Tables The Core Rule-Based translator mainly used to synthesize translation dataset.
Initialize it using TextLanguage, SignLanguage & SignFormat objects.
string slt.Sign
Video: Embedding/Feature extraction
Name Architecture Description Input format Output format
MediaPipe Landmarks
(Pose + Hands)
CNN based pipelines. See Here: Pose, Hands Encodes videos into pose vectors (3D world or 2D image) depicting the movements of the performer. List of numpy images
(n_frames, height, width, channels)
torch.Tensor
(n_frames, n_landmarks * 5)
Data generation: Language Models

Available Trained models

Name Architecture Description Input format Output format
N-Gram Langauge Model Hash Tables Predicts the next token based on learned statistics about previous N tokens. List of tokens (token, probability)
Transformer Language Model Decoder-only Transformers (GPT) Predicts next token using query-key-value attention, linear transformations and soft probabilities. torch.Tensor
(batch, token_ids)

List of tokens
torch.Tensor
(batch, token_ids, vocab_size)

(token, probability)

How to Build a Translator for Sign Language

To create your own sign language translator, you'll need these essential components:

  1. Data Collection
    1. Gather a collection of videos featuring individuals performing sign language gestures.
    2. Prepare a JSON file that maps video file names to corresponding text language words, phrases, or sentences that represent the gestures.
    3. Prepare a parallel corpus containing text language sentences and sequences of sign language video filenames.
  2. Language Processing
    1. Implement a subclass of slt.languages.TextLanguage:
      • Tokenize your text language and assign appropriate tags to the tokens for streamlined processing.
    2. Create a subclass of slt.languages.SignLanguage:
      • Map text tokens to video filenames using the provided JSON data.
      • Rearrange the sequence of video filenames to align with the grammar and structure of sign language.
  3. Rule-Based Translation
    1. Pass instances of your classes from the previous step to slt.models.ConcatenativeSynthesis class to obtain a rule-based translator object.
    2. Construct sentences in your text language and use the rule-based translator to generate sign language translations. (You can use our language models to generate such texts.)
  4. Model Fine-Tuning
    1. Utilize the sign language videos and corresponding text sentences from the previous step.
    2. Apply our training pipeline to fine-tune a chosen model for improved accuracy and translation quality.

Remember to contribute back to the community:

  • Share your data, code, and models by creating a pull request (PR), allowing others to benefit from your efforts.
  • Create your own sign language translator (e.g. as your university thesis) and contribute to a more inclusive and accessible world.

See more at Build Custom Translator section in ReadTheDocs or in this notebook.

Directory Tree

sign-language-translator
├── .readthedocs.yaml
├── MANIFEST.in
├── README.md
├── poetry.lock
├── pyproject.toml
├── requirements.txt
├── docs
│   └── *
├── tests
│   └── *
│
└── sign_language_translator
    ├── cli.py
    ├── assets (auto-downloaded)
    │   └── *
    │
    ├── config
    │   ├── assets.py
    │   ├── enums.py
    │   ├── settings.py
    │   ├── urls.json
    │   └── utils.py
    │
    ├── data_collection
    │   ├── completeness.py
    │   ├── scraping.py
    │   └── synonyms.py
    │
    ├── languages
    │   ├── utils.py
    │   ├── vocab.py
    │   ├── sign
    │   │   ├── mapping_rules.py
    │   │   ├── pakistan_sign_language.py
    │   │   └── sign_language.py
    │   │
    │   └── text
    │       ├── english.py
    │       ├── text_language.py
    │       └── urdu.py
    │
    ├── models
    │   ├── _utils.py
    │   ├── utils.py
    │   ├── language_models
    │   │   ├── abstract_language_model.py
    │   │   ├── beam_sampling.py
    │   │   ├── mixer.py
    │   │   ├── ngram_language_model.py
    │   │   └── transformer_language_model
    │   │       ├── layers.py
    │   │       ├── model.py
    │   │       └── train.py
    │   │
    │   ├── sign_to_text
    │   ├── text_to_sign
    │   │   ├── concatenative_synthesis.py
    │   │   └── t2s_model.py
    │   │
    │   └── video_embedding
    │       ├── mediapipe_landmarks_model.py
    │       └── video_embedding_model.py
    │
    ├── text
    │   ├── metrics.py
    │   ├── preprocess.py
    │   ├── subtitles.py
    │   ├── tagger.py
    │   ├── tokenizer.py
    │   └── utils.py
    │
    ├── utils
    │   ├── arrays.py
    │   ├── download.py
    │   ├── tree.py
    │   └── utils.py
    │
    └── vision
        ├── _utils.py
        ├── utils.py
        ├── landmarks
        ├── sign
        │   └── sign.py
        │
        └── video
            ├── display.py
            ├── transformations.py
            ├── video_iterators.py
            └── video.py

How to Contribute

Datasets:
  • Contribute by scraping, compiling, and centralizing video datasets.
  • Help with labeling word mapping datasets.
  • Establish connections with Academies for the Deaf to collaboratively develop standardized sign language grammar and integrate it into the rule-based translators.
New Code:
  • Create dedicated sign language classes catering to various regions.
  • Develop text language processing classes for diverse languages.
  • Experiment with training models using diverse hyper-parameters.
  • Don't forget to integrate string short codes of your classes and models into enums.py, and ensure to update functions like get_model() and get_.*_language().
  • Enhance the codebase with comprehensive docstrings, exemplary usage cases, and thorough test cases.
Existing Code:
  • Optimize the codebase by implementing techniques like parallel processing and batching.
  • Strengthen the project's documentation with clear docstrings, illustrative usage scenarios, and robust test coverage.
  • Contribute to the documentation for sign-language-translator ReadTheDocs to empower users with comprehensive insights. Currently it needs a template for auto-generated pages.
Product Development:
  • Engage in the development efforts across MLOps, backend, web, and mobile domains, depending on your expertise and interests.

Research Papers & Citation

Stay Tuned!

Upcoming/Roadmap

CLEAN_ARCHITECTURE_VISION: v0.7
# urls.json, extra-urls.json

# bugfix: inaccurate num_frames in video file metadata
# improvement: video wrapper class uses list of sources instead of linked list of videos
# video transformations

# landmarks wrapper class
# landmark augmentation

# subtitles
# trim signs before concatenation

# stabilize video batch using landmarks
LANGUAGES: v0.8
# implement NLP classes for English & Hindi
# Improve vocab class
# expand reference clip data by scraping everything
MISCELLANEOUS
# clean demonstration notebooks
# host video dataset online, descriptive filenames, zip extraction
# dataset info table
# sequence diagram for creating a translator
# make scraping dependencies optional (beautifulsoup4, deep_translator). remove overly specific scrapping functions
# GUI with gradio
DEEP_TRANSLATION: v0.9-v1.x
# parallel text corpus
# sign to text with custom seq2seq transformer
# sign to text with fine-tuned whisper
# pose vector generation with fine-tuned flan-T5
# motion transfer
# pose2video: stable diffusion or GAN?
# speech to text
# text to speech
# LanguageModel: experiment by dropping space tokens & bidirectional prediction
RESEARCH PAPERs
# datasets: clips, text, sentences, disambiguation
# rule based translation: describe entire repo
# deep sign-to-text: pipeline + experiments
# deep text-to-sign: pipeline + experiments
PRODUCT DEVELOPMENT
# ML inference server
# Django backend server
# React Frontend
# React Native mobile app

Credits and Gratitude

This project started in October 2021 as a BS Computer Science final year project with 3 students and 1 supervisor. After 9 months at university, it became a hobby project for Mudassar who has continued it till at least 2023-11-10.

Immense gratitude towards: (click to expand)
  • Mudassar Iqbal for coding the project so far.

  • Rabbia Arshad for help in initial R&D and web development.

  • Waqas Bin Abbas for assistance in initial video data collection process.

  • Kamran Malik for setting the initial project scope, idea of motion transfer and connecting us with Hamza Foundation.

  • Hamza Foundation (especially Ms Benish, Ms Rashda & Mr Zeeshan) for agreeing to collaborate and providing the reference clips, hearing-impaired performers for data creation, and creating the text2gloss dataset.

  • UrduHack (espacially Ikram Ali) for their work on Urdu character normalization.

  • Telha Bilal for help in designing the architecture of some modules.

Bonus

Count total number of lines of code (Package: 9287 + Tests: 1419):

git ls-files | grep '\.py' | xargs wc -l

Just for Fun

Q: What was the deaf student's favorite course?
A: Communication skills

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sign_language_translator-0.6.4.tar.gz (998.3 kB view hashes)

Uploaded Source

Built Distribution

sign_language_translator-0.6.4-py3-none-any.whl (1.0 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page