Carbohydrate Activity Annotation with protein Language Models

These details have not been verified by PyPI

Project links

Project description

CAALM: Carbohydrate Activity Annotation with protein Language Models

⚙️ Installation

Clone the Repository

git clone https://github.com/lczong/CAALM.git
cd CAALM

Set Up a Virtual Environment (Recommended)

conda create -n caalm python=3.10
conda activate caalm

Install PyTorch

Follow the installation below, or choose the build that matches your device (official guide | previous versions)

# CUDA 12.6
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# CPU only
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu

Install FAISS

# CPU (via pip or conda)
pip install faiss-cpu        # option 1
conda install faiss-cpu -c pytorch  # option 2

# GPU (conda recommended — pip may not work correctly)
conda install faiss-gpu -c pytorch

Install the Package
```
pip install .
```

Download Model Assets

Download the full CAALM Hugging Face repository into a directory named models in the project root:

python -c "from huggingface_hub import snapshot_download; snapshot_download('lczong/CAALM', local_dir='models')"

The expected layout after download is:

models/
├── level0/          # Level 0 binary classifier
├── level1/          # Level 1 multi-label classifier
└── level2/
    ├── model.pt     # Level 2 projection checkpoint
    ├── faiss/       # FAISS indices (<CLASS>.faiss)
    └── refdb/       # Reference TSVs (<CLASS>_labels.tsv)

📖 Usage

Prediction Flow

CAALM runs three levels in sequence:

Level 0 predicts whether a sequence is CAZy or non-CAZy.
If Level 0 predicts CAZy, Level 1 predicts one or more major CAZy classes from GT, GH, CBM, CE, PL, and AA.
Level 2 retrieves family labels from the FAISS index and reference database for each predicted Level 1 major class.

If Level 1 predicts multiple classes such as GH|CBM, Level 2 searches both major-class databases and writes one family prediction per major class.

Example Command

A convenience script is provided to run the example with one command:

./scripts/predict_example.sh

Or invoke the CLI directly:

caalm input/example.fasta

The output name defaults to the input filename stem (here example, from input/example.fasta), and output files are written to ./outputs/. To customise:

caalm your_sequences.fasta -o results --output-name my_run

Use caalm --help to see all options grouped by category.

Common Options

# Use a specific GPU
caalm input.fasta -d cuda:0

# Enable mixed precision for faster inference
caalm input.fasta --mixed-precision bf16

# Increase batch size for large-memory GPUs
caalm input.fasta -b 16

# Increase the level 2 projection batch size independently
caalm input.fasta -b2 1024

# Save level 1 embeddings for downstream analysis
caalm input.fasta --save-level1-embeddings

# Save level 0 embeddings
caalm input.fasta --save-level0-embeddings

# Save level 2 projected embeddings
caalm input.fasta --save-level2-embeddings

Models

The recommended setup is to download the full CAALM Hugging Face repository into a local models directory (see Installation step 6). If local files are not found, Level 0 and Level 1 will try to download from Hugging Face automatically.

Level	Description	Default path	CLI override
Level 0	Binary CAZy / non-CAZy classifier	`./models/level0`	`--level0-model`
Level 1	Multi-label major class classifier	`./models/level1`	`--level1-model`
Level 2	Projection checkpoint	`./models/level2/model.pt`	`--level2-model`
Level 2	FAISS indices (`<CLASS>.faiss`)	`./models/level2/faiss`	`--level2-faiss-dir`
Level 2	Reference TSVs (`<CLASS>_labels.tsv`)	`./models/level2/refdb`	`--level2-label-tsv-dir`

If --level2-families is omitted, Level 2 automatically uses each sequence's predicted Level 1 classes.

Outputs

Each run writes three main files under --output-dir with the prefix --output-name. When requested, embedding arrays are also saved as .npy files only.

*_predictions.tsv

sequence_id
pred_is_cazy
pred_cazy_class
pred_cazy_family

Notes:

pred_is_cazy is CAZy for CAZy sequences and Non-CAZy for non-CAZy sequences.
pred_cazy_class is empty for non-CAZy sequences.
pred_cazy_family is empty for non-CAZy sequences.
For multi-label Level 1 predictions, both pred_cazy_class and pred_cazy_family use | as the separator.

*_probabilities.jsonl

One JSON object per sequence.
level0.prob_is_cazy: probability from the binary classifier.
level1.class_probabilities: probabilities for GT, GH, CBM, CE, PL, and AA.
level2.predicted_families: family predictions for each predicted major class, including score, matched reference sequence, and vote count.
Saved probabilities and Level 2 scores are rounded to 5 decimal places.

*_statistics.tsv

Summary counts and percentages for Level 0, Level 1, and Level 2 outputs.

Optional embedding outputs

*_level0_embeddings.npy when --save-level0-embeddings is used.
*_level1_embeddings.npy when --save-level1-embeddings is used.
*_level2_embeddings.npy when --save-level2-embeddings is used.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caalm-1.0.0.tar.gz (28.7 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

caalm-1.0.0-py3-none-any.whl (27.9 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file caalm-1.0.0.tar.gz.

File metadata

Download URL: caalm-1.0.0.tar.gz
Upload date: Mar 27, 2026
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for caalm-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f843b1339cae239c6fd971d8bbccea87631c33068c919e47f674f621890d1725`
MD5	`e8aa0dfc9b7698f010c3d34761e8751f`
BLAKE2b-256	`1ef3b549336657437e90e74b959133ede5c7fa7710dd317455a4c40dc051eca8`

See more details on using hashes here.

File details

Details for the file caalm-1.0.0-py3-none-any.whl.

File metadata

Download URL: caalm-1.0.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 27.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for caalm-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c47322050d016a07b31d3fa8e9df54e2f63f8957c87762ad453a74317f1ee117`
MD5	`20e8bc97205d413d6b55add3dc1343ad`
BLAKE2b-256	`32064d306e1809aee1c5bf35b34601db259695992c75a519b187121113aa1c62`

See more details on using hashes here.

caalm 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CAALM: Carbohydrate Activity Annotation with protein Language Models

⚙️ Installation

📖 Usage

Prediction Flow

Example Command

Common Options

Models

Outputs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes