Skip to main content

Machine translation and understanding of classical Tibetan

Project description

CompassionAI Lotsawa - tools for translating and understanding classical Tibetan

This is a collection of end-user tools to help with translation and understanding of long texts in classical Tibetan, especially the Kangyur and Tengyur.

For example, in your terminal you can:

# Bring up an interactive tool for translating individual short Tibetan sections
lotsawa-translate

# Output a translation of the Heart Sutra into English under ./translations
lotsawa-translate mode=batch mode.input_glob=heart_sutra.bo

# After translating to English, re-translate the Heart Sutra into simplified Chinese
lotsawa-retranslate output.target_language_code=zho_Hans

# Bring up an interactive tool for splitting Tibetan sections into words and tagging those words as nouns/verbs/adjectives/etc
lotsawa-words

A sample set of Tibetan documents to experiment on is available at https://compassionai.s3.amazonaws.com/public/translation_test_docs.zip.

Lotsawa is backed by our novel models that are the results of a research program into how to convert existing state-of-the-art translation models for short sentences, such as NLLB (No Language Left Behind), into models that are better able to handle the ambiguity of long classical Tibetan texts. The models used in Lotsawa utilize pre-trained, state-of-the-art translation models as a backbone that have had their neural architectures significantly modified to accomodate long texts. In particular, we are not simply serving (fine-tuned) NLLB; we are serving a model with a new neural architecture that's much better than NLLB at handling Tibetan ambiguity. Lotsawa implements a carefully tuned end-to-end translation pipeline for long texts - the result of many experiments on strategies for the preservation of contextual semantic information in the low-resource setting of classical Tibetan. Please see https://www.compassion-ai.org/ for an explanation of our research.

We are a tiny team of volunteers on a shoestring budget. The community of people who would benefit from these tools is likewise very small. If we don't work together, these tools will struggle to improve and be useful.

PLEASE don't immediately give up and walk away if you run into a problem. Without at least a tiny bit of your help these tools will never evolve to benefit anyone. Please, for the sake of the Tibetan language and the Dharma, contact us before giving up.

Contact us if you're using these tools, if something isn't working and you need help, if the tools are performing poorly, just to say hi, or for any other reason.

We can be reached at contact@compassion-ai.org or on GitHub issues.

Installation

We assume you're on a Mac. The installation should work on Windows and Linux mutatis mutandis.

IMPORTANT: There is currently an incompatibility between Hydra and Python 3.11. Furthermore, attempting to install with Python 3.11 will require building Hugging Face libraries from source, since prebuilt wheels are not yet available. This requires gcc, cmake and Rust.

We require Python <= 3.10 for Lotsawa until this is fixed upstream. Our provided conda environment file has the correct settings.

Basic instructions

Install with pip:

pip install lotsawa --upgrade

Lotsawa requires Python 3.6 or greater. This shouldn't be a problem on almost any modern computer. If you are having issues with this on an older Mac, see the Homebrew documentation here: https://docs.brew.sh/Homebrew-and-Python. If you can't make it work or if the Homebrew docs are too much, contact us at contact@compassion-ai.org or open a GitHub issue.

Basic instructions - NVidia GPUs

This section does not apply to Macs. Newer Macs with M1 chips or better should use the embedded GPU by default.

If you have an NVidia GPU and want to use it to massively speed everything up - we strongly recommend doing this if you can - you will need to install CUDA-enabled PyTorch. Begin by installing the NVidia drivers and CUDA:

Then install CUDA-enabled PyTorch. This is very easy. The following line in your terminal should work:

pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116

If for some reason it doesn't, follow the instructions on https://pytorch.org/get-started/locally/:

  • Set the PyTorch build to Stable.
  • Set package to pip.
  • Set language to Python.
  • Set compute platform to CUDA 11.6 or greater.

It will give you a line of code to run, paste it into your terminal and you should be good. If you're not good, start without CUDA and contact us at contact@compassion-ai.org or open a GitHub issue.

As the usage of Lotsawa evolves we may simplify this process as needed.

Slightly more advanced - conda

If you're up to it, we recommend using a virtual environment to simplify your installation and management of your installed software. In our experience, conda is the easiest way to do this. Conda will keep the stuff needed to run Lotsawa separate from the rest of your computer. This way, if anything breaks, you can easily uninstall Lotsawa without affecting the rest of your programs.

Before installing with pip, begin by installing miniconda from here: https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html, then run the following:

conda create -n lotsawa
conda activate lotsawa
conda install -c conda-forge python>=3.6 pip
pip install lotsawa

Whenever you want to use Lotsawa, activate your virtual environment:

conda activate lotsawa
lotsawa-translate   # Or whatever you want to do

When you're done, either just close the terminal window or run:

conda deactivate

To uninstall Lotsawa and all the associated packages, including PyTorch, all you need to do is:

# Delete the virtual environment, including Lotsawa itself
conda env remove -n lotsawa

# Clear the model cache
echo rm -rf $(python -c "from torch.hub import get_dir; print(get_dir() + '/champion_models')") | bash

Developers - installing from source

PLEASE begin by dropping us a line at contact@compassion-ai.org so that we can work with you to make you successful. We have no sales team or anything like that, we will not hassle you, we just want to be helpful.

You will need to clone four CompassionAI repos: common, manas, garland and lotsawa:

export CAI_BASE_DIR=~/workspace/compassionai   # Or wherever you like
mkdir -p $CAI_BASE_DIR; cd $CAI_BASE_DIR
git clone git@github.com:CompassionAI/common.git
git clone git@github.com:CompassionAI/manas.git
git clone git@github.com:CompassionAI/garland.git
git clone git@github.com:CompassionAI/lotsawa.git

We strongly recommend using conda. In fact, we recommend mamba - it is much faster than conda, with no downside. We provide a minimal environment file for your convenience.

# Install conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# Install mamba
conda install mamba -c conda-forge

# Create the Lotsawa virtual environment
cd $CAI_BASE_DIR/lotsawa      # Or wherever you cloned the repo
mamba env create -f env.yml
conda activate lotsawa

Research

PLEASE begin by dropping us a line at contact@compassion-ai.org so that we can work with you to make you successful. We have no sales team or anything like that, we will not hassle you, we just want to be helpful.

We very strongly recommend doing research only on Linux.

If you're planning to do research, i.e. tinker with our datasets or tweak the models, you probably want the data registry:

git@github.com:CompassionAI/data-registry.git
cd data-registry
./pull.sh   # Warning: large download

Follow the installation instructions in CompassionAI/common for research.

Usage

Translation into English

Use the lotsawa-translate utility. It has two modes: interactive and batch.

  • Interactive mode will prompt you for individual short Tibetan sections and will output English translations. This is intended as a test or a demo.
  • Batch mode will process long Tibetan files (in Unicode with uchen script). This mode will involve segmentation of the long text into shorter sections, followed by sequential translating with context. NB: the segmented sections will not translate the same in batch as in interactive mode due to the use of context during translation.

Interactive is the default mode. An example that uses batch mode is:

lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo

This will translate all texts in the directory ~/tibetan_texts that have the extension .bo and output the results to ./translations. To control the output directory, set mode.output_dir.

A sample set of Tibetan documents to experiment on is available at https://compassionai.s3.amazonaws.com/public/translation_test_docs.zip.

To use CUDA, pass in cuda=true. For example:

lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true

If your GPU has less than 8GB of VRAM you may see CUDA OOM errors. We recommend reducing the number of beams during beam search. The easiest way to do this is as follows:

lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true generation=slow     # 50 beams, default
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true generation=medium   # 20 beams
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true generation=fast     # 5 beams

We recommend trying cuda=false generation=slow on some sample text to compare against. If you're unhappy with the results and would benefit from a more complex memory management protocol during beam decoding, please contact us at contact@compassion-ai.org or open a GitHub issue.

See the full help for lotsawa-translate for a complete list of options:

lotsawa-translate --help

Advanced options:

  • Use bad-words-list to create word exclusion lists during translations.
  • You can provide configuration overrides on a per-file basis. In the same folder as the Tibetan file you can provide a YAML configuration file with overrides. For example, see the override file for the Manjusrinamasamgiti in the data registry under processed_datasets/translation-test-docs.

Re-translation into other languages

To translate into languages other than English, we find that the best results were to translate to English first and then zero-shot translate from English to the target language using NLLB. We provide the simple tool lotsawa-retranslate to facilitate this. This strategy works best for translation into other high resource languages such as simplified Chinese.

If you are trying to use this tool but are still running into issues please contact us at contact@compassion-ai.org or on our GitHub issues page. Some issues you may face could be: seeing a lot of English in the target output, toxicity or other bad words, or excessive pronoun/context switching. While we saw quite good results with this tool, we are not professional translators. We are likely to be able to improve the model if we understand your use case, please contact us.

The tool will translate all English files that match the input glob into the target language. The input glob defaults to translations/*.en and the output extension defaults to the language code. For a readable list of the language codes, please see table 1, 204 Languages of No Language Left Behind, on pages 13-16 in the NLLB model paper at https://arxiv.org/pdf/2207.04672. You can also use the argument list_language_codes=true to print out all language codes.

Pass in cuda=true to use an NVidia GPU. You shouldn't run out of memory with the settings used here. If you are, please contact us.

As an example, to translate a directory with Tibetan texts in it into simplified Chinese:

# First, translate into English
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo

# Reviewing the English translation here will improve the Chinese

# Finally, re-translate the English into Chinese
lotsawa-retranslate output.target_language_code=zho_Hans

The results will be in ./translations with the extension .zho_Hans.

You do not need to use lotsawa-translate to produce the English text. The lotsawa-retranslate tool will go through the input files line by line, skip any lines with Tibetan characters in them, and translate each remaining line using NLLB. The most important thing to know is: NLLB works well on short inputs. A simple approach with English would be to split every English sentence into its own line. PLEASE contact us at contact@compassion-ai.org so that we can help, or open an issue on our GitHub page.

If you're interested in using the 84,000 XML files, note that the tool will not do any preprocessing, such as unfolding the XML tags. The class TeiLoader, found in the CompassionAI/common repo under cai_common/data/tei_loader.py, uses BeautifulSoup to extract and clean the translations from the 84,000 XML files. Please contact us if you're interested in using this code.

The retranslation tool has the generation options configuration group that allows fine-grained control over the text generation algorithm.

  • You can tweak the beam search parameters, such as the number of beams or the repetition penalty.

  • You can specify a word blacklist here. Can be especially useful to avoid some nonsensical named entity translations or toxic terms, or to stop the model from inserting English terms into non-English translations.

    Note that, for languages that don't have word indicators such as spaces, the word blacklist is unlikely to work well. This is because the NLLB tokenizer is based on SentencePiece, which produces highly contextual tokenizations. The blacklist is converted to tokens. We will implement a way around this if we find one. If you're using the blacklist to exclude English from your generated text, consider using an alphabet restriction.

  • You can constrain the model to generate a specific alphabet. The supported alphabets are, in alphabetical order: Arabic, Armenian, Bengali, Chinese (corresponds to the Unicode alphabet name "CJK Unified"), Cyrillic, Devangari, Ethiopic, Georgian, Greek, Gujarati, Gurmukhi, Hangul, Hebrew, Hiragana, Kannada, Katakana, Khmer, Lao, Latin, Malayalam, Myanmar, Ol Chiki, Oriya, Sinhala, Tamil, Telugu, Thai, Tibetan (intended for research, eg back-translation), Tifinagh. All alphabets include the full range of characters in the Unicode alphabet with that name, so for example Cyrillic includes all non-Russian characters from other Slavic languages such as Ukrainian.

    Make sure your alphabet makes sense for your target language! Lotsawa will not warn you if it doesn't.

NB: Due to the nature of maximum-likelihood-like neural network training combined with the fact that the NLLB dataset was scraped from the internet, alphabet constraints are very likely to generate text in the language most represented in that alphabet on the internet. In particular, Latin will generate English, Cyrillic will generate Russian, Devangari will generate Hindi, Myanmar will generate Burmese, etc. This is an artifact of the protocol used to train NLLB, it is not a cultural or political statement of some kind by anyone involved. You should get good results when combining alphabet constraints with language target code conditioning. For example,

lotsawa-retranslate output.target_language_code=ukr_Cyrl generation.alphabet=cyrillic

should produce good Ukrainian. Implementing new alphabet constraints is usually straightforward. Feel free to contribute, or please contact us at contact@compassion-ai.org or open an issue on our GitHub page.

See the full help for lotsawa-retranslate for a complete list of options:

lotsawa-retranslate --help

Word segmentation and part-of-speech tagging

We provide interactive and batch segmentation and tagging.

  • Interactive mode will prompt you for individual short Tibetan sections and will output word segmentations with part-of-speech tags. This is intended as a test or a demo.
  • Batch mode will process long Tibetan files (in Unicode with uchen script). This mode will involve segmentation of the long text into shorter sections, followed by word segmentation and part-of-speech tagging.

To run the tool in interactive mode, just activate your conda environment (if any) and use:

lotsawa-words

Interactive is the default mode. An example that uses batch mode is:

lotsawa-words mode=batch mode.input_glob=~/tibetan_texts/*.bo

This will segment and tag all texts in the directory ~/tibetan_texts that have the extension .bo and output the results to ./pos_tags. To control the output directory, set mode.output_dir.

To use CUDA, pass in cuda=true. For example:

lotsawa-words mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true

We are working on updating the models underlying this tool, especially the tokenization. If you have a use case for our token classifiers that needs a different delivery of the models, or if you need us to change how the models themselves work, please contact us at contact@compassion-ai.org or open an issue on our GitHub page.

Cleaning the model cache

Lotsawa will download the trained CompassionAI language models into the PyTorch Hub cache. The models can get fairly large, for example our current best model for Tibetan-English translation is 1.8GB.

To clear the cache, simply delete the PyTorch Hub cache. This is safe, if you run Lotsawa again it will re-create the cache and re-download the models. To find the cache directory, run this in your terminal:

python -c "from torch.hub import get_dir; print(get_dir())"

This will print the directory, which you can then explore and delete if you like. To delete only the CompassionAI cache with a single terminal command, use:

echo rm -rf $(python -c "from torch.hub import get_dir; print(get_dir() + '/champion_models')") | bash

For developers

Hydra

Lotsawa uses Hydra for its configuration. Hydra is a tool developed by Facebook to manage complex configuration, especially in the machine learning space. It enables reproducible results, configuration as code, grouping of configuration options, and easy defaults and overrides. See https://hydra.cc/ for details.

Embedding Lotsawa's backend into your own applications

The utilities we provide in the Lotsawa package are simple wrappers around helper classes provided by the Garland and Manas CompassionAI repos. The helper classes encapsulate the loading of the models, the encoding and decoding of the text, and implement any algorithms we layered on top of the model decoding to create our results. For example, the Translator class encapsulates the process of maintaing the target language context during batch translation.

The source code for the provided utilities can be found in:

lotsawa/lotsawa/translate.py
lotsawa/lotsawa/retranslate.py
lotsawa/lotsawa/part_of_speech.py

Fine-tuning Lotsawa's models

We provide the code for this in the Garland and Manas repositories. Documentation may be sparse currently. Please contact us at contact@compassion-ai.org or open an issue on our GitHub page. We have no sales team or anything like that, we will not hassle you, we just want to be helpful.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lotsawa-0.3.4.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

lotsawa-0.3.4-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file lotsawa-0.3.4.tar.gz.

File metadata

  • Download URL: lotsawa-0.3.4.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.0

File hashes

Hashes for lotsawa-0.3.4.tar.gz
Algorithm Hash digest
SHA256 d77ef9a1b4bdcf3918859d81599f04e7399d0e9bbe4318ae856c6083f36cf1ac
MD5 1f7b295cbf119466b781f9ee72231783
BLAKE2b-256 ffce48938c2acf9bce899a41386cc09793fa2e978cf955ff944be326e69bbe6c

See more details on using hashes here.

File details

Details for the file lotsawa-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: lotsawa-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.0

File hashes

Hashes for lotsawa-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7105df49ce40399ae4d75947ceb94471c78c229cb9c69e7c3cb942c585482ed7
MD5 1ae6d630920f92bcdf16b9c1e98f6a80
BLAKE2b-256 5babc24e46e72bb0ba25035fbdde633ef38acebde8cfb75653ae78739452bd4e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page