Skip to main content

Carbohydrate Activity Annotation with protein Language Models

Project description

CAALM: Carbohydrate Activity Annotation with protein Language Models

⚙️ Installation

  1. Clone the Repository

    git clone https://github.com/lczong/CAALM.git
    cd CAALM
    
  2. Set Up a Virtual Environment (Recommended)

    conda create -n caalm python=3.10
    conda activate caalm
    
  3. Install PyTorch

    Follow the installation below, or choose the build that matches your device (official guide | previous versions)

    # CUDA 12.6
    pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126
    
    # CPU only
    pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu
    
  4. Install FAISS

    # CPU (via pip or conda)
    pip install faiss-cpu        # option 1
    conda install faiss-cpu -c pytorch  # option 2
    
    # GPU (conda recommended — pip may not work correctly)
    conda install faiss-gpu -c pytorch
    
  5. Install the Package

    pip install .
    
  6. Download Model Assets

    Download the full CAALM Hugging Face repository into a directory named models in the project root:

    python -c "from huggingface_hub import snapshot_download; snapshot_download('lczong/CAALM', local_dir='models')"
    

    The expected layout after download is:

    models/
    ├── level0/          # Level 0 binary classifier
    ├── level1/          # Level 1 multi-label classifier
    └── level2/
        ├── model.pt     # Level 2 projection checkpoint
        ├── faiss/       # FAISS indices (<CLASS>.faiss)
        └── refdb/       # Reference TSVs (<CLASS>_labels.tsv)
    

📖 Usage

Prediction Flow

CAALM runs three levels in sequence:

  1. Level 0 predicts whether a sequence is CAZy or non-CAZy.
  2. If Level 0 predicts CAZy, Level 1 predicts one or more major CAZy classes from GT, GH, CBM, CE, PL, and AA.
  3. Level 2 retrieves family labels from the FAISS index and reference database for each predicted Level 1 major class.

If Level 1 predicts multiple classes such as GH|CBM, Level 2 searches both major-class databases and writes one family prediction per major class.

Example Command

A convenience script is provided to run the example with one command:

./scripts/predict_example.sh

Or invoke the CLI directly:

caalm input/example.fasta

The output name defaults to the input filename stem (here example, from input/example.fasta), and output files are written to ./outputs/. To customise:

caalm your_sequences.fasta -o results --output-name my_run

Use caalm --help to see all options grouped by category.

Common Options

# Use a specific GPU
caalm input.fasta -d cuda:0

# Enable mixed precision for faster inference
caalm input.fasta --mixed-precision bf16

# Increase batch size for large-memory GPUs
caalm input.fasta -b 16

# Increase the level 2 projection batch size independently
caalm input.fasta -b2 1024

# Save level 1 embeddings for downstream analysis
caalm input.fasta --save-level1-embeddings

# Save level 0 embeddings
caalm input.fasta --save-level0-embeddings

# Save level 2 projected embeddings
caalm input.fasta --save-level2-embeddings

Models

The recommended setup is to download the full CAALM Hugging Face repository into a local models directory (see Installation step 6). If local files are not found, Level 0 and Level 1 will try to download from Hugging Face automatically.

Level Description Default path CLI override
Level 0 Binary CAZy / non-CAZy classifier ./models/level0 --level0-model
Level 1 Multi-label major class classifier ./models/level1 --level1-model
Level 2 Projection checkpoint ./models/level2/model.pt --level2-model
Level 2 FAISS indices (<CLASS>.faiss) ./models/level2/faiss --level2-faiss-dir
Level 2 Reference TSVs (<CLASS>_labels.tsv) ./models/level2/refdb --level2-label-tsv-dir

If --level2-families is omitted, Level 2 automatically uses each sequence's predicted Level 1 classes.

Outputs

Each run writes three main files under --output-dir with the prefix --output-name. When requested, embedding arrays are also saved as .npy files only.

*_predictions.tsv

  • sequence_id
  • pred_is_cazy
  • pred_cazy_class
  • pred_cazy_family

Notes:

  • pred_is_cazy is CAZy for CAZy sequences and Non-CAZy for non-CAZy sequences.
  • pred_cazy_class is empty for non-CAZy sequences.
  • pred_cazy_family is empty for non-CAZy sequences.
  • For multi-label Level 1 predictions, both pred_cazy_class and pred_cazy_family use | as the separator.

*_probabilities.jsonl

  • One JSON object per sequence.
  • level0.prob_is_cazy: probability from the binary classifier.
  • level1.class_probabilities: probabilities for GT, GH, CBM, CE, PL, and AA.
  • level2.predicted_families: family predictions for each predicted major class, including score, matched reference sequence, and vote count.
  • Saved probabilities and Level 2 scores are rounded to 5 decimal places.

*_statistics.tsv

  • Summary counts and percentages for Level 0, Level 1, and Level 2 outputs.

Optional embedding outputs

  • *_level0_embeddings.npy when --save-level0-embeddings is used.
  • *_level1_embeddings.npy when --save-level1-embeddings is used.
  • *_level2_embeddings.npy when --save-level2-embeddings is used.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caalm-1.0.0.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caalm-1.0.0-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file caalm-1.0.0.tar.gz.

File metadata

  • Download URL: caalm-1.0.0.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for caalm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f843b1339cae239c6fd971d8bbccea87631c33068c919e47f674f621890d1725
MD5 e8aa0dfc9b7698f010c3d34761e8751f
BLAKE2b-256 1ef3b549336657437e90e74b959133ede5c7fa7710dd317455a4c40dc051eca8

See more details on using hashes here.

File details

Details for the file caalm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: caalm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for caalm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c47322050d016a07b31d3fa8e9df54e2f63f8957c87762ad453a74317f1ee117
MD5 20e8bc97205d413d6b55add3dc1343ad
BLAKE2b-256 32064d306e1809aee1c5bf35b34601db259695992c75a519b187121113aa1c62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page