Skip to main content

Paper - Pytorch

Project description

Multi-Modality

Progen

Implementation of Progen in Pytorch, from the paper "ProGen: Language Modeling for Protein Generation"

GPT for proteins sequences

Paper Link

Appreciation

  • Lucidrains
  • Agorians

Install

pip install progen-torch

Usage

import torch
from progen.model import ProGen

x = torch.randint(0, 100, (1, 1024))
import torch
from progen.model import ProGen

x = torch.randint(0, 100, (1, 1024))

# Initialize the model with specific parameters
model = ProGen(
    num_tokens=100,  # The size of the vocabulary
    dim=512,  # The dimension of the embeddings
    seq_len=1024,  # The length of the sequences
    depth=6,  # The number of layers in the model
    window_size=256,  # The size of the window for local attention
    global_mlp_depth=2,  # The depth of the MLP in the global attention mechanism
    heads=8,  # The number of attention heads
    dim_head=512,  # The dimension of each attention head
    ff_mult=4,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu=True,  # Whether to use a GLU activation in the feed-forward network
    attn_dim=None,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate=True,  # Whether to clamp the gate values in the GLU activation
    shift_tokens=True,  # Whether to shift the tokens for the causal attention mechanism
    dropout=0.1,  # The dropout rate
)

# Forward pass through the model
logits = model(x)

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print(logits.shape)  # Should print: torch.Size([1, 1024, 100])

Dataset Strategy

Here is a table of the datasets used in the paper with metadata and source links:

Dataset Description Source
Uniparc Contains protein sequences from various sources https://www.uniprot.org/uniparc/
UniprotKB Contains protein sequences and annotations https://www.uniprot.org/uniprot/
SWISS-PROT Curated protein sequence database https://www.uniprot.org/swiss-prot/
TrEMBL Computer-annotated protein sequences https://www.uniprot.org/trembl/
Pfam Database of protein families https://pfam.xfam.org/
NCBI taxonomy Taxonomic classification of organisms https://www.ncbi.nlm.nih.gov/taxonomy

Here is a diagram showing the data preprocessing flow:

graph TD
    A[Uniparc] --> B[Filter and merge]
    C[UniprotKB] --> B
    D[SWISS-PROT] --> B 
    E[TrEMBL] --> B
    F[Pfam] --> B
    G[NCBI taxonomy] --> B
    B --> H[Train/test split]
    H --> I[Train set]
    H --> J[ID test set] 
    H --> K[OOD test set]

The Uniparc, UniprotKB, SWISS-PROT, TrEMBL, Pfam, and NCBI taxonomy datasets are filtered and merged in step B. The aggregated dataset is then split into training, in-distribution test, and out-of-distribution test sets in step H.

Architecture

Todo

License

MIT

Citations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

progen_torch-0.0.4.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

progen_torch-0.0.4-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file progen_torch-0.0.4.tar.gz.

File metadata

  • Download URL: progen_torch-0.0.4.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for progen_torch-0.0.4.tar.gz
Algorithm Hash digest
SHA256 96b7ee58a600ef61e48761ab5b712e7c8cba50ab2c1d2d3e6a37c5311d32c953
MD5 e32c533cebd4440d2e5213ffe9d2cf01
BLAKE2b-256 ef3922208336bddfb277e15fec662aa554254e04cf6955d7e834530e32d5ee00

See more details on using hashes here.

File details

Details for the file progen_torch-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: progen_torch-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for progen_torch-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f789a64329f30e26b947606d73b31b5b95e73bad462e48e7dfa80980e372fe70
MD5 85a96f4adbd263a9ebe94089fce8e959
BLAKE2b-256 8cb311053a05868bf7da1f633043f8e43a9563ca6d960639d838c7f763b325f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page