Skip to main content

LucaGPLM - The LUCA general purpose language model.

Project description

LucaGPLM

LucaGPLM - The LUCA general purpose language model.

Installation

You can install the package from source using pip:

pip install .

Usage

Basic Model Usage

from lucagplm import LucaGPLMModel, LucaGPLMTokenizer

# Load model
model = LucaGPLMModel.from_pretrained("Yuanfei/lucavirus-large-step3.8M")
tokenizer = LucaGPLMTokenizer.from_pretrained("Yuanfei/lucavirus-large-step3.8M")

# Example usage
seq = "ATCG"
inputs = tokenizer(seq, seq_type="gene",return_tensors="pt")
outputs = model(**inputs)

seq = "NSQTA"
inputs = tokenizer(seq, seq_type="prot",return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

Pretraining Model Usage

The package also includes a pretraining model with multiple pretraining heads for different tasks:

from lucagplm import LucaGPLMForPretraining, LucaGPLMTokenizer

# Load pretraining model
model = LucaGPLMForPretraining.from_pretrained("path/to/pretraining/model")
tokenizer = LucaGPLMTokenizer.from_pretrained("path/to/pretraining/model")

# Example usage with pretraining tasks
seq = "ATCGATCGATCG"
inputs = tokenizer(seq, seq_type="gene", return_tensors="pt")

# Forward pass with pretraining heads
outputs = model(**inputs)

# Access logits for different pretraining tasks
print("Available task logits:", list(outputs['logits'].keys()))

# Token-level tasks (e.g., masked language modeling)
if 'token_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['token_level'].items():
        print(f"Token-level task '{task_name}' logits shape:", logits.shape)

# Span-level tasks
if 'span_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['span_level'].items():
        print(f"Span-level task '{task_name}' logits shape:", logits.shape)

# Sequence-level tasks
if 'seq_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['seq_level'].items():
        print(f"Sequence-level task '{task_name}' logits shape:", logits.shape)

Converting Old Models

The package includes a utility script to convert old LucaOneVirus checkpoints to the new LucaGPLM format:

Using the command-line tool:

# Convert without pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model

# Convert with pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model --with-pretraining-heads

Using the Python API:

from lucagplm.convert_model import convert_old_weights

# Convert without pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=False
)

# Convert with pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=True
)

Pretraining Tasks

The LucaGPLMForPretraining model includes multiple pretraining tasks organized into three levels:

  1. Token-level tasks: Tasks that operate on individual tokens

    • mlm: Masked Language Modeling
    • erc: Entity Recognition and Classification
    • pos: Part-of-Speech tagging
  2. Span-level tasks: Tasks that operate on spans of tokens

    • ner: Named Entity Recognition
    • sbo: Span Boundary Optimization
    • spr: Span Prediction and Recovery
  3. Sequence-level tasks: Tasks that operate on entire sequences

    • cls: Sequence Classification
    • sim: Sequence Similarity
    • gen: Sequence Generation

Each task has its own prediction head (classifier) that can be fine-tuned for specific downstream applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucagplm-1.1.4.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lucagplm-1.1.4-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file lucagplm-1.1.4.tar.gz.

File metadata

  • Download URL: lucagplm-1.1.4.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lucagplm-1.1.4.tar.gz
Algorithm Hash digest
SHA256 71c0774dce940a5b258334dc48f7a602c956a7b4821d13ca41f9ba0ea8774d46
MD5 88af56fe99ff891d3d465c6564d89553
BLAKE2b-256 5fc89e68940a4c85831262c82360b5a8608a670a492649898d3dd98927ed9b77

See more details on using hashes here.

File details

Details for the file lucagplm-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: lucagplm-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lucagplm-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8000051e67402643add1d59f9aef8e50ca3794b228e3c2e016d83e7e537d57d9
MD5 346dea3c4ea6b5fc518c718176341039
BLAKE2b-256 647e993ad0e9462d23e44c84924bae10fa20354d79477d82429ce9dc8cd6e46e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page