CompassionAI Project Manas - a bidirectional Tibetan transformer
Project description
CompassionAI project Manas - classical Tibetan language understanding models
Monolingual classical literary Tibetan modeling. The current focus is on pretrained transformer models for:
- Monolingual tasks that are useful for teaching to read Tibetan, especially word segmentation, part-of-speech tagging and named entity recognition.
- Use as an encoder for the machine translation model.
Installation
There are two modes for this library - inference and research. We provide instructions for Linux.
- Inference should work on MacOS and Windows mutatis mutandis.
- We very strongly recommend doing research only on Linux. We will not provide any support to people trying to perform research tasks without installing Linux.
Virtual environment
We strongly recommend using a virtual environment for all your Python package installations, including anything from CompassionAI. To facilitate this, we provide a simple Conda environment YAML file in the CompassionAI/common repo. We recommend first installing miniconda, see https://docs.conda.io/en/main/miniconda.html. We then recommend installing Mamba, see https://github.com/mamba-org/mamba.
bash Miniconda3-latest-Linux-x86_64.sh
conda install mamba -c conda-forge
cd compassionai/common
mamba env create -f env-minimal.yml -n my-env
conda activate my-env
Inference
Just install with pip:
pip install compassionai-manas
Research
Begin by installing for inference. Then install the CompassionAI data registry repo and set two environment variables:
$CAI_TEMP_PATH
$CAI_DATA_BASE_PATH
We strongly recommend setting them with conda in your virtual environment:
conda activate my-env
conda env config vars set CAI_TEMP_PATH=#directory on a mountpoint with plenty of space, does not need to be fast
conda env config vars set CAI_DATA_BASE_PATH=#absolute path to the CompassionAI data registry
Our code uses these environment variables to load datasets from the registry, output processed datasets and store training results.
Usage
Inference
This is a supporting library for our main inference repos, such as Lotsawa. You shouldn't need to use it directly.
Research
This library implements language understanding for classical Tibetan.
- Tokenization.
- Pre-training code.
- Fine-tuning on language understanding tasks, such as word segmentation and part-of-speech tagging.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file compassionai-manas-0.2.2.tar.gz
.
File metadata
- Download URL: compassionai-manas-0.2.2.tar.gz
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55381d4636cde28f9adce0a6185c1698510f3e6ebbb331bdd3503c61c59727c9 |
|
MD5 | 04abc7e45d07eed6c4ce1518679d3839 |
|
BLAKE2b-256 | 55c21f4e01ca4d0cbb95a72c2bdc8572194d1c173455b2659750994324681889 |
File details
Details for the file compassionai_manas-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: compassionai_manas-0.2.2-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 973cdb0a402fa44f7d28f8230a4922e0a2a56c60a74297795883f7b7e7d74ecd |
|
MD5 | c8eb64b6b4efb7d907db26b514353a1f |
|
BLAKE2b-256 | c679303b1792789e13dd2ef5ca900ac6340bc19aad879d1287e4828a23de8f39 |