Carbohydrate Activity Annotation with protein Language Models
Project description
CAALM: Carbohydrate Activity Annotation with protein Language Models
⚙️ Installation
-
Clone the Repository
git clone https://github.com/lczong/CAALM.git cd CAALM
-
Set Up a Virtual Environment (Recommended)
conda create -n caalm python=3.10 conda activate caalm
-
Install PyTorch
Follow the installation below, or choose the build that matches your device (official guide | previous versions)
# CUDA 12.6 pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126 # CPU only pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu
-
Install FAISS
# CPU (via pip or conda) pip install faiss-cpu # option 1 conda install faiss-cpu -c pytorch # option 2 # GPU (conda recommended — pip may not work correctly) conda install faiss-gpu -c pytorch
-
Install the Package
pip install .
-
Download Model Assets
Download the full CAALM Hugging Face repository into a directory named
modelsin the project root:python -c "from huggingface_hub import snapshot_download; snapshot_download('lczong/CAALM', local_dir='models')"
The expected layout after download is:
models/ ├── level0/ # Level 0 binary classifier ├── level1/ # Level 1 multi-label classifier └── level2/ ├── model.pt # Level 2 projection checkpoint ├── faiss/ # FAISS indices (<CLASS>.faiss) └── refdb/ # Reference TSVs (<CLASS>_labels.tsv)
📖 Usage
Prediction Flow
CAALM runs three levels in sequence:
- Level 0 predicts whether a sequence is
CAZyornon-CAZy. - If Level 0 predicts CAZy, Level 1 predicts one or more major CAZy classes from
GT,GH,CBM,CE,PL, andAA. - Level 2 retrieves family labels from the FAISS index and reference database for each predicted Level 1 major class.
If Level 1 predicts multiple classes such as GH|CBM, Level 2 searches both major-class databases and writes one family prediction per major class.
Example Command
A convenience script is provided to run the example with one command:
./scripts/predict_example.sh
Or invoke the CLI directly:
caalm input/example.fasta
The output name defaults to the input filename stem (here example, from input/example.fasta), and output files are written to ./outputs/. To customise:
caalm your_sequences.fasta -o results --output-name my_run
Use caalm --help to see all options grouped by category.
Common Options
# Use a specific GPU
caalm input.fasta -d cuda:0
# Enable mixed precision for faster inference
caalm input.fasta --mixed-precision bf16
# Increase batch size for large-memory GPUs
caalm input.fasta -b 16
# Increase the level 2 projection batch size independently
caalm input.fasta -b2 1024
# Save level 1 embeddings for downstream analysis
caalm input.fasta --save-level1-embeddings
# Save level 0 embeddings
caalm input.fasta --save-level0-embeddings
# Save level 2 projected embeddings
caalm input.fasta --save-level2-embeddings
Models
The recommended setup is to download the full CAALM Hugging Face repository into a local models directory (see Installation step 6). If local files are not found, Level 0 and Level 1 will try to download from Hugging Face automatically.
| Level | Description | Default path | CLI override |
|---|---|---|---|
| Level 0 | Binary CAZy / non-CAZy classifier | ./models/level0 |
--level0-model |
| Level 1 | Multi-label major class classifier | ./models/level1 |
--level1-model |
| Level 2 | Projection checkpoint | ./models/level2/model.pt |
--level2-model |
| Level 2 | FAISS indices (<CLASS>.faiss) |
./models/level2/faiss |
--level2-faiss-dir |
| Level 2 | Reference TSVs (<CLASS>_labels.tsv) |
./models/level2/refdb |
--level2-label-tsv-dir |
If --level2-families is omitted, Level 2 automatically uses each sequence's predicted Level 1 classes.
Outputs
Each run writes three main files under --output-dir with the prefix --output-name. When requested, embedding arrays are also saved as .npy files only.
*_predictions.tsv
sequence_idpred_is_cazypred_cazy_classpred_cazy_family
Notes:
pred_is_cazyisCAZyfor CAZy sequences andNon-CAZyfor non-CAZy sequences.pred_cazy_classis empty for non-CAZy sequences.pred_cazy_familyis empty for non-CAZy sequences.- For multi-label Level 1 predictions, both
pred_cazy_classandpred_cazy_familyuse|as the separator.
*_probabilities.jsonl
- One JSON object per sequence.
level0.prob_is_cazy: probability from the binary classifier.level1.class_probabilities: probabilities forGT,GH,CBM,CE,PL, andAA.level2.predicted_families: family predictions for each predicted major class, including score, matched reference sequence, and vote count.- Saved probabilities and Level 2 scores are rounded to 5 decimal places.
*_statistics.tsv
- Summary counts and percentages for Level 0, Level 1, and Level 2 outputs.
Optional embedding outputs
*_level0_embeddings.npywhen--save-level0-embeddingsis used.*_level1_embeddings.npywhen--save-level1-embeddingsis used.*_level2_embeddings.npywhen--save-level2-embeddingsis used.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file caalm-1.0.0.tar.gz.
File metadata
- Download URL: caalm-1.0.0.tar.gz
- Upload date:
- Size: 28.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f843b1339cae239c6fd971d8bbccea87631c33068c919e47f674f621890d1725
|
|
| MD5 |
e8aa0dfc9b7698f010c3d34761e8751f
|
|
| BLAKE2b-256 |
1ef3b549336657437e90e74b959133ede5c7fa7710dd317455a4c40dc051eca8
|
File details
Details for the file caalm-1.0.0-py3-none-any.whl.
File metadata
- Download URL: caalm-1.0.0-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c47322050d016a07b31d3fa8e9df54e2f63f8957c87762ad453a74317f1ee117
|
|
| MD5 |
20e8bc97205d413d6b55add3dc1343ad
|
|
| BLAKE2b-256 |
32064d306e1809aee1c5bf35b34601db259695992c75a519b187121113aa1c62
|