Machine translation and understanding of classical Tibetan
Project description
CompassionAI Lotsawa - tools for translating and understanding classical Tibetan
This is a collection of end-user tools to help with translation and understanding of long texts in classical Tibetan, especially the Kangyur and Tengyur.
For example, in your terminal you can:
# Bring up an interactive tool for translating individual short Tibetan sections
lotsawa-translate
# Output a translation of the Heart Sutra into English under ./translations
lotsawa-translate mode=batch mode.input_glob=heart_sutra.bo
# After translating to English, re-translate the Heart Sutra into simplified Chinese
lotsawa-retranslate output.target_language_code=zho_Hans
# Bring up an interactive tool for splitting Tibetan sections into words and tagging those words as nouns/verbs/adjectives/etc
lotsawa-words
A sample set of Tibetan documents to experiment on is available at https://compassionai.s3.amazonaws.com/public/translation_test_docs.zip.
Lotsawa is backed by our novel models that are the results of a research program into how to convert existing state-of-the-art translation models for short sentences, such as NLLB (No Language Left Behind), into models that are better able to handle the ambiguity of long classical Tibetan texts. The models used in Lotsawa utilize pre-trained, state-of-the-art translation models as a backbone that have had their neural architectures significantly modified to accomodate long texts. In particular, we are not simply serving (fine-tuned) NLLB; we are serving a model with a new neural architecture that's much better than NLLB at handling Tibetan ambiguity. Lotsawa implements a carefully tuned end-to-end translation pipeline for long texts - the result of many experiments on strategies for the preservation of contextual semantic information in the low-resource setting of classical Tibetan. Please see https://www.compassion-ai.org/ for an explanation of our research.
We are a tiny team of volunteers on a shoestring budget. The community of people who would benefit from these tools is likewise very small. If we don't work together, these tools will struggle to improve and be useful.
PLEASE don't immediately give up and walk away if you run into a problem. Without at least a tiny bit of your help these tools will never evolve to benefit anyone. Please, for the sake of the Tibetan language and the Dharma, contact us before giving up.
Contact us if you're using these tools, if something isn't working and you need help, if the tools are performing poorly, just to say hi, or for any other reason.
We can be reached at contact@compassion-ai.org or on GitHub issues.
Installation
We assume you're on a Mac. The installation should work on Windows and Linux mutatis mutandis.
IMPORTANT: There is currently an incompatibility between Hydra and Python 3.11. Furthermore, attempting to install with Python 3.11 will require building Hugging Face libraries from source, since prebuilt wheels are not yet available. This requires gcc, cmake and Rust.
We require Python <= 3.10 for Lotsawa until this is fixed upstream. Our provided conda environment file has the correct settings.
Basic instructions
Install with pip:
pip install lotsawa --upgrade
Lotsawa requires Python 3.6 or greater. This shouldn't be a problem on almost any modern computer. If you are having issues with this on an older Mac, see the Homebrew documentation here: https://docs.brew.sh/Homebrew-and-Python. If you can't make it work or if the Homebrew docs are too much, contact us at contact@compassion-ai.org or open a GitHub issue.
Basic instructions - NVidia GPUs
This section does not apply to Macs. Newer Macs with M1 chips or better should use the embedded GPU by default.
If you have an NVidia GPU and want to use it to massively speed everything up - we strongly recommend doing this if you can - you will need to install CUDA-enabled PyTorch. Begin by installing the NVidia drivers and CUDA:
- Windows: https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=11&target_type=exe_local.
- Linux: There's lots of tutorials for the common distros, as well as the NVidia official document at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html.
- Cloud: start with an image with CUDA pre-installed, or follow the instructions for your specific OS.
Then install CUDA-enabled PyTorch. This is very easy. The following line in your terminal should work:
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
If for some reason it doesn't, follow the instructions on https://pytorch.org/get-started/locally/:
- Set the PyTorch build to Stable.
- Set package to pip.
- Set language to Python.
- Set compute platform to CUDA 11.6 or greater.
It will give you a line of code to run, paste it into your terminal and you should be good. If you're not good, start without CUDA and contact us at contact@compassion-ai.org or open a GitHub issue.
As the usage of Lotsawa evolves we may simplify this process as needed.
Slightly more advanced - conda
If you're up to it, we recommend using a virtual environment to simplify your installation and management of your installed software. In our experience, conda is the easiest way to do this. Conda will keep the stuff needed to run Lotsawa separate from the rest of your computer. This way, if anything breaks, you can easily uninstall Lotsawa without affecting the rest of your programs.
Before installing with pip
, begin by installing miniconda from here: https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html, then run the following:
conda create -n lotsawa
conda activate lotsawa
conda install -c conda-forge python>=3.6 pip
pip install lotsawa
Whenever you want to use Lotsawa, activate your virtual environment:
conda activate lotsawa
lotsawa-translate # Or whatever you want to do
When you're done, either just close the terminal window or run:
conda deactivate
To uninstall Lotsawa and all the associated packages, including PyTorch, all you need to do is:
# Delete the virtual environment, including Lotsawa itself
conda env remove -n lotsawa
# Clear the model cache
echo rm -rf $(python -c "from torch.hub import get_dir; print(get_dir() + '/champion_models')") | bash
Developers - installing from source
PLEASE begin by dropping us a line at contact@compassion-ai.org so that we can work with you to make you successful. We have no sales team or anything like that, we will not hassle you, we just want to be helpful.
You will need to clone four CompassionAI repos: common, manas, garland and lotsawa:
export CAI_BASE_DIR=~/workspace/compassionai # Or wherever you like
mkdir -p $CAI_BASE_DIR; cd $CAI_BASE_DIR
git clone git@github.com:CompassionAI/common.git
git clone git@github.com:CompassionAI/manas.git
git clone git@github.com:CompassionAI/garland.git
git clone git@github.com:CompassionAI/lotsawa.git
We strongly recommend using conda. In fact, we recommend mamba - it is much faster than conda, with no downside. We provide a minimal environment file for your convenience.
# Install conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Install mamba
conda install mamba -c conda-forge
# Create the Lotsawa virtual environment
cd $CAI_BASE_DIR/lotsawa # Or wherever you cloned the repo
mamba env create -f env.yml
conda activate lotsawa
Research
PLEASE begin by dropping us a line at contact@compassion-ai.org so that we can work with you to make you successful. We have no sales team or anything like that, we will not hassle you, we just want to be helpful.
We very strongly recommend doing research only on Linux.
If you're planning to do research, i.e. tinker with our datasets or tweak the models, you probably want the data registry:
git@github.com:CompassionAI/data-registry.git
cd data-registry
./pull.sh # Warning: large download
Follow the installation instructions in CompassionAI/common for research.
Usage
Translation into English
Use the lotsawa-translate
utility. It has two modes: interactive and batch.
- Interactive mode will prompt you for individual short Tibetan sections and will output English translations. This is intended as a test or a demo.
- Batch mode will process long Tibetan files (in Unicode with uchen script). This mode will involve segmentation of the long text into shorter sections, followed by sequential translating with context. NB: the segmented sections will not translate the same in batch as in interactive mode due to the use of context during translation.
Interactive is the default mode. An example that uses batch mode is:
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo
This will translate all texts in the directory ~/tibetan_texts
that have the extension .bo
and output the results to ./translations
. To control the output directory, set mode.output_dir
.
A sample set of Tibetan documents to experiment on is available at https://compassionai.s3.amazonaws.com/public/translation_test_docs.zip.
To use CUDA, pass in cuda=true
. For example:
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true
If your GPU has less than 8GB of VRAM you may see CUDA OOM errors. We recommend reducing the number of beams during beam search. The easiest way to do this is as follows:
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true generation=slow # 50 beams, default
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true generation=medium # 20 beams
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true generation=fast # 5 beams
We recommend trying cuda=false generation=slow
on some sample text to compare against. If you're unhappy with the results and would benefit from a more complex memory management protocol during beam decoding, please contact us at contact@compassion-ai.org or open a GitHub issue.
See the full help for lotsawa-translate
for a complete list of options:
lotsawa-translate --help
Advanced options:
- Use
bad-words-list
to create word exclusion lists during translations. - You can provide configuration overrides on a per-file basis. In the same folder as the Tibetan file you can provide a YAML configuration file with overrides. For example, see the override file for the Manjusrinamasamgiti in the data registry under
processed_datasets/translation-test-docs
.
Re-translation into other languages
To translate into languages other than English, we find that the best results were to translate to English first and then zero-shot translate from English to the target language using NLLB. We provide the simple tool lotsawa-retranslate
to facilitate this. This strategy works best for translation into other high resource languages such as simplified Chinese.
If you are trying to use this tool but are still running into issues please contact us at contact@compassion-ai.org or on our GitHub issues page. Some issues you may face could be: seeing a lot of English in the target output, toxicity or other bad words, or excessive pronoun/context switching. While we saw quite good results with this tool, we are not professional translators. We are likely to be able to improve the model if we understand your use case, please contact us.
The tool will translate all English files that match the input glob into the target language. The input glob defaults to translations/*.en
and the output extension defaults to the language code. For a readable list of the language codes, please see table 1, 204 Languages of No Language Left Behind, on pages 13-16 in the NLLB model paper at https://arxiv.org/pdf/2207.04672. You can also use the argument list_language_codes=true
to print out all language codes.
Pass in cuda=true
to use an NVidia GPU. You shouldn't run out of memory with the settings used here. If you are, please contact us.
As an example, to translate a directory with Tibetan texts in it into simplified Chinese:
# First, translate into English
lotsawa-translate mode=batch mode.input_glob=~/tibetan_texts/*.bo
# Reviewing the English translation here will improve the Chinese
# Finally, re-translate the English into Chinese
lotsawa-retranslate output.target_language_code=zho_Hans
The results will be in ./translations
with the extension .zho_Hans
.
You do not need to use lotsawa-translate
to produce the English text. The lotsawa-retranslate
tool will go through the input files line by line, skip any lines with Tibetan characters in them, and translate each remaining line using NLLB. The most important thing to know is: NLLB works well on short inputs. A simple approach with English would be to split every English sentence into its own line. PLEASE contact us at contact@compassion-ai.org so that we can help, or open an issue on our GitHub page.
If you're interested in using the 84,000 XML files, note that the tool will not do any preprocessing, such as unfolding the XML tags. The class TeiLoader
, found in the CompassionAI/common repo under cai_common/data/tei_loader.py
, uses BeautifulSoup to extract and clean the translations from the 84,000 XML files. Please contact us if you're interested in using this code.
The retranslation tool has the generation options configuration group that allows fine-grained control over the text generation algorithm.
-
You can tweak the beam search parameters, such as the number of beams or the repetition penalty.
-
You can specify a word blacklist here. Can be especially useful to avoid some nonsensical named entity translations or toxic terms, or to stop the model from inserting English terms into non-English translations.
Note that, for languages that don't have word indicators such as spaces, the word blacklist is unlikely to work well. This is because the NLLB tokenizer is based on SentencePiece, which produces highly contextual tokenizations. The blacklist is converted to tokens. We will implement a way around this if we find one. If you're using the blacklist to exclude English from your generated text, consider using an alphabet restriction.
-
You can constrain the model to generate a specific alphabet. The supported alphabets are, in alphabetical order: Arabic, Armenian, Bengali, Chinese (corresponds to the Unicode alphabet name "CJK Unified"), Cyrillic, Devangari, Ethiopic, Georgian, Greek, Gujarati, Gurmukhi, Hangul, Hebrew, Hiragana, Kannada, Katakana, Khmer, Lao, Latin, Malayalam, Myanmar, Ol Chiki, Oriya, Sinhala, Tamil, Telugu, Thai, Tibetan (intended for research, eg back-translation), Tifinagh. All alphabets include the full range of characters in the Unicode alphabet with that name, so for example Cyrillic includes all non-Russian characters from other Slavic languages such as Ukrainian.
Make sure your alphabet makes sense for your target language! Lotsawa will not warn you if it doesn't.
NB: Due to the nature of maximum-likelihood-like neural network training combined with the fact that the NLLB dataset was scraped from the internet, alphabet constraints are very likely to generate text in the language most represented in that alphabet on the internet. In particular, Latin will generate English, Cyrillic will generate Russian, Devangari will generate Hindi, Myanmar will generate Burmese, etc. This is an artifact of the protocol used to train NLLB, it is not a cultural or political statement of some kind by anyone involved. You should get good results when combining alphabet constraints with language target code conditioning. For example,
lotsawa-retranslate output.target_language_code=ukr_Cyrl generation.alphabet=cyrillic
should produce good Ukrainian. Implementing new alphabet constraints is usually straightforward. Feel free to contribute, or please contact us at contact@compassion-ai.org or open an issue on our GitHub page.
See the full help for lotsawa-retranslate
for a complete list of options:
lotsawa-retranslate --help
Word segmentation and part-of-speech tagging
We provide interactive and batch segmentation and tagging.
- Interactive mode will prompt you for individual short Tibetan sections and will output word segmentations with part-of-speech tags. This is intended as a test or a demo.
- Batch mode will process long Tibetan files (in Unicode with uchen script). This mode will involve segmentation of the long text into shorter sections, followed by word segmentation and part-of-speech tagging.
To run the tool in interactive mode, just activate your conda environment (if any) and use:
lotsawa-words
Interactive is the default mode. An example that uses batch mode is:
lotsawa-words mode=batch mode.input_glob=~/tibetan_texts/*.bo
This will segment and tag all texts in the directory ~/tibetan_texts
that have the extension .bo
and output the results to ./pos_tags
. To control the output directory, set mode.output_dir
.
To use CUDA, pass in cuda=true
. For example:
lotsawa-words mode=batch mode.input_glob=~/tibetan_texts/*.bo cuda=true
We are working on updating the models underlying this tool, especially the tokenization. If you have a use case for our token classifiers that needs a different delivery of the models, or if you need us to change how the models themselves work, please contact us at contact@compassion-ai.org or open an issue on our GitHub page.
Cleaning the model cache
Lotsawa will download the trained CompassionAI language models into the PyTorch Hub cache. The models can get fairly large, for example our current best model for Tibetan-English translation is 1.8GB.
To clear the cache, simply delete the PyTorch Hub cache. This is safe, if you run Lotsawa again it will re-create the cache and re-download the models. To find the cache directory, run this in your terminal:
python -c "from torch.hub import get_dir; print(get_dir())"
This will print the directory, which you can then explore and delete if you like. To delete only the CompassionAI cache with a single terminal command, use:
echo rm -rf $(python -c "from torch.hub import get_dir; print(get_dir() + '/champion_models')") | bash
For developers
Hydra
Lotsawa uses Hydra for its configuration. Hydra is a tool developed by Facebook to manage complex configuration, especially in the machine learning space. It enables reproducible results, configuration as code, grouping of configuration options, and easy defaults and overrides. See https://hydra.cc/ for details.
Embedding Lotsawa's backend into your own applications
The utilities we provide in the Lotsawa package are simple wrappers around helper classes provided by the Garland and Manas CompassionAI repos. The helper classes encapsulate the loading of the models, the encoding and decoding of the text, and implement any algorithms we layered on top of the model decoding to create our results. For example, the Translator
class encapsulates the process of maintaing the target language context during batch translation.
The source code for the provided utilities can be found in:
lotsawa/lotsawa/translate.py
lotsawa/lotsawa/retranslate.py
lotsawa/lotsawa/part_of_speech.py
Fine-tuning Lotsawa's models
We provide the code for this in the Garland and Manas repositories. Documentation may be sparse currently. Please contact us at contact@compassion-ai.org or open an issue on our GitHub page. We have no sales team or anything like that, we will not hassle you, we just want to be helpful.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lotsawa-0.3.4.tar.gz
.
File metadata
- Download URL: lotsawa-0.3.4.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d77ef9a1b4bdcf3918859d81599f04e7399d0e9bbe4318ae856c6083f36cf1ac |
|
MD5 | 1f7b295cbf119466b781f9ee72231783 |
|
BLAKE2b-256 | ffce48938c2acf9bce899a41386cc09793fa2e978cf955ff944be326e69bbe6c |
File details
Details for the file lotsawa-0.3.4-py3-none-any.whl
.
File metadata
- Download URL: lotsawa-0.3.4-py3-none-any.whl
- Upload date:
- Size: 48.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7105df49ce40399ae4d75947ceb94471c78c229cb9c69e7c3cb942c585482ed7 |
|
MD5 | 1ae6d630920f92bcdf16b9c1e98f6a80 |
|
BLAKE2b-256 | 5babc24e46e72bb0ba25035fbdde633ef38acebde8cfb75653ae78739452bd4e |