Skip to main content

LucaGPLM - The LUCA general purpose language model.

Project description

LucaGPLM

LucaGPLM - The LUCA general purpose language model.

Installation

You can install the package from source using pip:

pip install .

Usage

Basic Model Usage

from lucagplm import LucaGPLMModel, LucaGPLMTokenizer

# Load model
model = LucaGPLMModel.from_pretrained("Yuanfei/lucavirus-large-step3.8M")
tokenizer = LucaGPLMTokenizer.from_pretrained("Yuanfei/lucavirus-large-step3.8M")

# Example usage
seq = "ATCG"
inputs = tokenizer(seq, seq_type="gene",return_tensors="pt")
outputs = model(**inputs)

seq = "NSQTA"
inputs = tokenizer(seq, seq_type="prot",return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

Pretraining Model Usage

The package also includes a pretraining model with multiple pretraining heads for different tasks:

from lucagplm import LucaGPLMForPretraining, LucaGPLMTokenizer

# Load pretraining model
model = LucaGPLMForPretraining.from_pretrained("path/to/pretraining/model")
tokenizer = LucaGPLMTokenizer.from_pretrained("path/to/pretraining/model")

# Example usage with pretraining tasks
seq = "ATCGATCGATCG"
inputs = tokenizer(seq, seq_type="gene", return_tensors="pt")

# Forward pass with pretraining heads
outputs = model(**inputs)

# Access logits for different pretraining tasks
print("Available task logits:", list(outputs['logits'].keys()))

# Token-level tasks (e.g., masked language modeling)
if 'token_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['token_level'].items():
        print(f"Token-level task '{task_name}' logits shape:", logits.shape)

# Span-level tasks
if 'span_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['span_level'].items():
        print(f"Span-level task '{task_name}' logits shape:", logits.shape)

# Sequence-level tasks
if 'seq_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['seq_level'].items():
        print(f"Sequence-level task '{task_name}' logits shape:", logits.shape)

Converting Old Models

The package includes a utility script to convert old LucaOneVirus checkpoints to the new LucaGPLM format:

Using the command-line tool:

# Convert without pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model

# Convert with pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model --with-pretraining-heads

Using the Python API:

from lucagplm.convert_model import convert_old_weights

# Convert without pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=False
)

# Convert with pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=True
)

Pretraining Tasks

The LucaGPLMForPretraining model includes multiple pretraining tasks organized into three levels:

  1. Token-level tasks: Tasks that operate on individual tokens

    • mlm: Masked Language Modeling
    • erc: Entity Recognition and Classification
    • pos: Part-of-Speech tagging
  2. Span-level tasks: Tasks that operate on spans of tokens

    • ner: Named Entity Recognition
    • sbo: Span Boundary Optimization
    • spr: Span Prediction and Recovery
  3. Sequence-level tasks: Tasks that operate on entire sequences

    • cls: Sequence Classification
    • sim: Sequence Similarity
    • gen: Sequence Generation

Each task has its own prediction head (classifier) that can be fine-tuned for specific downstream applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucagplm-1.1.3.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lucagplm-1.1.3-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file lucagplm-1.1.3.tar.gz.

File metadata

  • Download URL: lucagplm-1.1.3.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for lucagplm-1.1.3.tar.gz
Algorithm Hash digest
SHA256 5d2204f93ff43f6d40e9d367d1ab75d73470ddb492aa9287aa0ba87e4b2d0815
MD5 cf95292b3c8db3e941368b5b650d5ec2
BLAKE2b-256 1dc9dbf812613637e731cb59c91c2b7904d1be0355932333c3be8bfab38cf290

See more details on using hashes here.

File details

Details for the file lucagplm-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: lucagplm-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for lucagplm-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a1f3019243f3f468664736c009646271e37bed812cdf20cda5013ceee40a8756
MD5 8bd3c883957fa1dc1a97f66fc3f17fdd
BLAKE2b-256 9a5fd6b5195503f470416b751b80cc991f7e06782ee70b40fec4feef9a75a232

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page