Skip to main content

LucaGPLM - The LUCA general purpose language model.

Project description

LucaGPLM

LucaGPLM - The LUCA general purpose language model.

Installation

You can install the package from source using pip:

pip install .

Usage

Basic Model Usage

from lucagplm import LucaGPLMModel, LucaGPLMTokenizer

# Load model
model = LucaGPLMModel.from_pretrained("Yuanfei/lucavirus-large-step3.8M")
tokenizer = LucaGPLMTokenizer.from_pretrained("Yuanfei/lucavirus-large-step3.8M")

# Example usage
seq = "ATCG"
inputs = tokenizer(seq, seq_type="gene",return_tensors="pt")
outputs = model(**inputs)

seq = "NSQTA"
inputs = tokenizer(seq, seq_type="prot",return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

Pretraining Model Usage

The package also includes a pretraining model with multiple pretraining heads for different tasks:

from lucagplm import LucaGPLMForPretraining, LucaGPLMTokenizer

# Load pretraining model
model = LucaGPLMForPretraining.from_pretrained("path/to/pretraining/model")
tokenizer = LucaGPLMTokenizer.from_pretrained("path/to/pretraining/model")

# Example usage with pretraining tasks
seq = "ATCGATCGATCG"
inputs = tokenizer(seq, seq_type="gene", return_tensors="pt")

# Forward pass with pretraining heads
outputs = model(**inputs)

# Access logits for different pretraining tasks
print("Available task logits:", list(outputs['logits'].keys()))

# Token-level tasks (e.g., masked language modeling)
if 'token_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['token_level'].items():
        print(f"Token-level task '{task_name}' logits shape:", logits.shape)

# Span-level tasks
if 'span_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['span_level'].items():
        print(f"Span-level task '{task_name}' logits shape:", logits.shape)

# Sequence-level tasks
if 'seq_level' in outputs['logits']:
    for task_name, logits in outputs['logits']['seq_level'].items():
        print(f"Sequence-level task '{task_name}' logits shape:", logits.shape)

Converting Old Models

The package includes a utility script to convert old LucaOneVirus checkpoints to the new LucaGPLM format:

Using the command-line tool:

# Convert without pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model

# Convert with pretraining heads
lucagplm-convert --old-checkpoint /path/to/old/checkpoint --output-dir /path/to/new/model --with-pretraining-heads

Using the Python API:

from lucagplm.convert_model import convert_old_weights

# Convert without pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=False
)

# Convert with pretraining heads
convert_old_weights(
    old_checkpoint_path="/path/to/old/checkpoint",
    output_dir="/path/to/new/model",
    with_pretraining_heads=True
)

Pretraining Tasks

The LucaGPLMForPretraining model includes multiple pretraining tasks organized into three levels:

  1. Token-level tasks: Tasks that operate on individual tokens

    • mlm: Masked Language Modeling
    • erc: Entity Recognition and Classification
    • pos: Part-of-Speech tagging
  2. Span-level tasks: Tasks that operate on spans of tokens

    • ner: Named Entity Recognition
    • sbo: Span Boundary Optimization
    • spr: Span Prediction and Recovery
  3. Sequence-level tasks: Tasks that operate on entire sequences

    • cls: Sequence Classification
    • sim: Sequence Similarity
    • gen: Sequence Generation

Each task has its own prediction head (classifier) that can be fine-tuned for specific downstream applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lucagplm-1.1.2.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lucagplm-1.1.2-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file lucagplm-1.1.2.tar.gz.

File metadata

  • Download URL: lucagplm-1.1.2.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for lucagplm-1.1.2.tar.gz
Algorithm Hash digest
SHA256 09ce627eafdd630071c3863776aa32153548f9844030b16c93962d952b82b80a
MD5 3d06d88301f8e5120dd870413efafe50
BLAKE2b-256 589d9f55fdc5e14b9b52fd54ed7be21fa5eb56480e2ce330a0296fe2cce806fd

See more details on using hashes here.

File details

Details for the file lucagplm-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: lucagplm-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for lucagplm-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7cb98ae87f95217e1a1ed08d81fddbcd49c4dbbf9950cf073184510ab2992fc6
MD5 04e08598752a3a0e0b36c1a667375ae1
BLAKE2b-256 9c5c844eba5d958872e07ffef6aebb1ee6b5248a243492cf97817a6ec9a70ee8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page