Paper - Pytorch
Project description
Progen
Implementation of Progen in Pytorch, from the paper "ProGen: Language Modeling for Protein Generation"
GPT for proteins sequences
Appreciation
- Lucidrains
- Agorians
Install
pip install progen-torch
Usage
import torch
from progen.model import ProGen
x = torch.randint(0, 100, (1, 1024))
import torch
from progen.model import ProGen
x = torch.randint(0, 100, (1, 1024))
# Initialize the model with specific parameters
model = ProGen(
num_tokens=100, # The size of the vocabulary
dim=512, # The dimension of the embeddings
seq_len=1024, # The length of the sequences
depth=6, # The number of layers in the model
window_size=256, # The size of the window for local attention
global_mlp_depth=2, # The depth of the MLP in the global attention mechanism
heads=8, # The number of attention heads
dim_head=512, # The dimension of each attention head
ff_mult=4, # The multiplier for the feed-forward network's hidden layer size
ff_glu=True, # Whether to use a GLU activation in the feed-forward network
attn_dim=None, # The dimension of the attention mechanism (None means it defaults to `dim`)
clamp_gate=True, # Whether to clamp the gate values in the GLU activation
shift_tokens=True, # Whether to shift the tokens for the causal attention mechanism
dropout=0.1, # The dropout rate
)
# Forward pass through the model
logits = model(x)
# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print(logits.shape) # Should print: torch.Size([1, 1024, 100])
Dataset Strategy
Here is a table of the datasets used in the paper with metadata and source links:
Dataset | Description | Source |
---|---|---|
Uniparc | Contains protein sequences from various sources | https://www.uniprot.org/uniparc/ |
UniprotKB | Contains protein sequences and annotations | https://www.uniprot.org/uniprot/ |
SWISS-PROT | Curated protein sequence database | https://www.uniprot.org/swiss-prot/ |
TrEMBL | Computer-annotated protein sequences | https://www.uniprot.org/trembl/ |
Pfam | Database of protein families | https://pfam.xfam.org/ |
NCBI taxonomy | Taxonomic classification of organisms | https://www.ncbi.nlm.nih.gov/taxonomy |
Here is a diagram showing the data preprocessing flow:
graph TD
A[Uniparc] --> B[Filter and merge]
C[UniprotKB] --> B
D[SWISS-PROT] --> B
E[TrEMBL] --> B
F[Pfam] --> B
G[NCBI taxonomy] --> B
B --> H[Train/test split]
H --> I[Train set]
H --> J[ID test set]
H --> K[OOD test set]
The Uniparc, UniprotKB, SWISS-PROT, TrEMBL, Pfam, and NCBI taxonomy datasets are filtered and merged in step B. The aggregated dataset is then split into training, in-distribution test, and out-of-distribution test sets in step H.
Architecture
Todo
License
MIT
Citations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file progen_torch-0.0.4.tar.gz
.
File metadata
- Download URL: progen_torch-0.0.4.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96b7ee58a600ef61e48761ab5b712e7c8cba50ab2c1d2d3e6a37c5311d32c953 |
|
MD5 | e32c533cebd4440d2e5213ffe9d2cf01 |
|
BLAKE2b-256 | ef3922208336bddfb277e15fec662aa554254e04cf6955d7e834530e32d5ee00 |
File details
Details for the file progen_torch-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: progen_torch-0.0.4-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f789a64329f30e26b947606d73b31b5b95e73bad462e48e7dfa80980e372fe70 |
|
MD5 | 85a96f4adbd263a9ebe94089fce8e959 |
|
BLAKE2b-256 | 8cb311053a05868bf7da1f633043f8e43a9563ca6d960639d838c7f763b325f8 |