Skip to main content

LucaGPLM - The LUCA general purpose language model.

Project description

LucaGPLM

LucaGPLM - The LUCA general purpose language model.

Installation

You can install the package from source using pip:

pip install .

Usage

Basic Model Usage

from lucagplm import LucaGPLMModel, LucaGPLMTokenizer

# Load model
model = LucaGPLMModel.from_pretrained("Yuanfei/lucavirus-large-step3.8M")
tokenizer = LucaGPLMTokenizer.from_pretrained("Yuanfei/lucavirus-large-step3.8M")

# Example usage
seq = "ATCG"
inputs = tokenizer(seq, seq_type="gene",return_tensors="pt")
outputs = model(**inputs)

seq = "NSQTA"
inputs = tokenizer(seq, seq_type="prot",return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

Pretraining Model Usage

The package also includes a pretraining model with multiple pretraining heads for different tasks:

from lucagplm import LucaGPLMForPretraining, LucaGPLMTokenizer

# Load pretraining model
model = LucaGPLMForPretraining.from_pretrained("path/to/pretraining/model")
tokenizer = LucaGPLMTokenizer.from_pretrained("path/to/pretraining/model")

# Example usage with pretraining tasks
seq = "ATCGATCGATCG"
inputs = tokenizer(seq, seq_type="gene", return_tensors="pt")

# Forward pass with pretraining heads
outputs = model(**inputs)

# Access logits for different pretraining tasks
print("Available task logits:", list(outputs['logits'].keys()))

# Token-level tasks (e.g., masked language modeling)
if 'token_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['token_level'].items():
        print(f"Token-level task '{task_name}' logits shape:", logits.shape)

# Span-level tasks
if 'span_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['span_level'].items():
        print(f"Span-level task '{task_name}' logits shape:", logits.shape)

# Sequence-level tasks
if 'seq_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['seq_level'].items():
        print(f"Sequence-level task '{task_name}' logits shape:", logits.shape)

Converting Old Models

The package includes a utility script to convert old LucaOneVirus checkpoints to the new LucaGPLM format:

Using the command-line tool:

# Convert without pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model

# Convert with pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model --with-pretraining-heads

Using the Python API:

from lucagplm.convert_model import convert_old_weights

# Convert without pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=False
)

# Convert with pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=True
)

Pretraining Tasks

The LucaGPLMForPretraining model includes multiple pretraining tasks organized into three levels:

  1. Token-level tasks: Tasks that operate on individual tokens

    • mlm: Masked Language Modeling
    • erc: Entity Recognition and Classification
    • pos: Part-of-Speech tagging
  2. Span-level tasks: Tasks that operate on spans of tokens

    • ner: Named Entity Recognition
    • sbo: Span Boundary Optimization
    • spr: Span Prediction and Recovery
  3. Sequence-level tasks: Tasks that operate on entire sequences

    • cls: Sequence Classification
    • sim: Sequence Similarity
    • gen: Sequence Generation

Each task has its own prediction head (classifier) that can be fine-tuned for specific downstream applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucagplm-1.1.1.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lucagplm-1.1.1-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file lucagplm-1.1.1.tar.gz.

File metadata

  • Download URL: lucagplm-1.1.1.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for lucagplm-1.1.1.tar.gz
Algorithm Hash digest
SHA256 b507112910932b18fd3e3f999c7f342e2e33b79181fa6faaca1742d540f02e89
MD5 12ee84b93abdba331271f9406eab0706
BLAKE2b-256 94af8f866f0ceae423291d2cb12d41e11dc910609576b59551a4b77f56633db1

See more details on using hashes here.

File details

Details for the file lucagplm-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: lucagplm-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for lucagplm-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 982dd8838b5221278fb963604f9058169110333f69993c956da9abf57f097a35
MD5 fd0dcf058b9eaf402d6d8ce13f41467a
BLAKE2b-256 715c3ca06187a3bd2b2b34a39552027abb1b6cd7f11a33a2a96391e99937602e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page