Protein secondary structure prediction (3-state & 8-state) using XGBoost and PSSM features
Project description
S8kPred — Protein Secondary Structure Prediction
S8kPred predicts protein secondary structure directly from amino acid sequence using XGBoost models trained on PSSM (Position-Specific Scoring Matrix) and tripeptide propensity features.
- 3-state: Helix (H), Beta-strand (E), Coil/Loop (L)
- 8-state: H, G, I, E, B, T, S, L (full DSSP alphabet)
Requirements
| Dependency | Purpose |
|---|---|
| Python ≥ 3.9 | Runtime |
numpy, pandas, xgboost, scikit-learn |
Core ML pipeline |
biopython |
FASTA I/O for PSI-BLAST |
| NCBI PSI-BLAST | PSSM generation (external binary) |
| UniRef50 (or similar) BLAST database | PSI-BLAST database |
biotite, matplotlib (optional) |
Cartoon structure plots |
Installation
From PyPI (recommended)
pip install s8kpred
With cartoon plot support
pip install s8kpred[plot]
From GitHub (latest development version)
pip install git+https://github.com/mayank2801/s8kpred.git
From source
git clone https://github.com/mayank2801/s8kpred.git
cd s8kpred
pip install -e . # editable install
pip install -e .[plot] # with plotting extras
Setting up PSI-BLAST
S8kPred requires NCBI PSI-BLAST to generate evolutionary features. You have two options:
Option A — System install
# Ubuntu / Debian
sudo apt install ncbi-blast+
# macOS (Homebrew)
brew install blast
# Conda
conda install -c bioconda -c conda-forge "blast>=2.14"
Option B — Manual download
Download the NCBI BLAST+ toolkit from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Then either add the bin/ folder to your PATH or pass the full path via --psiblast.
Setting up a BLAST database
S8kPred works best with UniRef50. Download and format it:
# Download
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz
gunzip uniref50.fasta.gz
# Build BLAST database
mkdir -p ~/blast_dbs/uniref50
makeblastdb -in uniref50.fasta \
-dbtype prot \
-out ~/blast_dbs/uniref50/uniref50 \
-title "UniRef50"
Then point s8kpred at it:
export S8KPRED_BLASTDB=~/blast_dbs/uniref50/uniref50
or pass --blastdb ~/blast_dbs/uniref50/uniref50 on every invocation.
Model data files
The trained XGBoost models and lookup tables are not bundled in the PyPI wheel because of their size. Download them from the Releases page and place them in the s8kpred/data/ directory inside your Python environment:
s8kpred/data/
TriPeptidePropensityThreeStateSecStructure2AND.csv
TriPeptidePropensityEightStateSecStructure.csv
TripeptideBinaryTable_60.csv
model_3state.json
model_8state.ubj
Or override paths at runtime:
s8kpred predict -i input.fasta \
--blastdb ~/blast_dbs/uniref50/uniref50 \
--model-3state /path/to/model_3state.json \
--model-8state /path/to/model_8state.ubj
Quick start
Command line
# Single FASTA file
s8kpred predict -i protein.fasta --blastdb ~/blast_dbs/uniref50/uniref50
# Multi-sequence FASTA
s8kpred predict -i multi_seq.fasta --blastdb ~/blast_dbs/uniref50/uniref50
# Multiple separate FASTA files in one run
s8kpred predict -i seq1.fasta seq2.fasta seq3.fasta \
--blastdb ~/blast_dbs/uniref50/uniref50
# Inline sequence (no file needed)
s8kpred predict \
--sequence MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGD \
--id my_protein \
--blastdb ~/blast_dbs/uniref50/uniref50
# Custom output folder and job name
s8kpred predict -i input.fasta \
--blastdb ~/blast_dbs/uniref50/uniref50 \
--output-dir ./results \
--job experiment_01
# Skip 8-state prediction
s8kpred predict -i input.fasta --blastdb ... --no-8state
# Skip cartoon plots
s8kpred predict -i input.fasta --blastdb ... --no-plot
# Quiet mode (suppress progress output)
s8kpred predict -i input.fasta --blastdb ... --quiet
# Use more PSI-BLAST threads
s8kpred predict -i input.fasta --blastdb ... --threads 16
# Override PSI-BLAST location
s8kpred predict -i input.fasta \
--psiblast /opt/ncbi-blast/bin/psiblast \
--blastdb ~/blast_dbs/uniref50/uniref50
Python API
from s8kpred import predict, predict_file
# ── Single sequence ──────────────────────────────────────────────────
result = predict(
sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGD",
seq_id="my_protein",
blastdb="/data/blast/uniref50/uniref50",
)
print(result.results_3state["my_protein"]) # e.g. "CCCHHHHHHCCEEEEC..."
print(result.results_8state["my_protein"]) # e.g. "LLLHHHHHHLLEEEELL..."
print(result.job_dir) # Path to all output files
# ── Single FASTA file ────────────────────────────────────────────────
result = predict_file(
fasta_file="proteins.fasta",
blastdb="/data/blast/uniref50/uniref50",
output_dir="./results",
)
print(result.summary())
# ── Multi-sequence FASTA ─────────────────────────────────────────────
result = predict_file("multi_seq.fasta", blastdb="...")
for seq_id, ss in result.results_3state.items():
print(f"{seq_id}: {ss}")
# ── Custom model paths ────────────────────────────────────────────────
from pathlib import Path
result = predict_file(
"proteins.fasta",
blastdb="...",
model_3state=Path("/models/model_3state.json"),
model_8state=Path("/models/model_8state.ubj"),
)
# ── Skip 8-state to save time ─────────────────────────────────────────
result = predict("MKTAYI...", blastdb="...", run_8state=False)
Output files
All outputs are written to a timestamped job directory under --output-dir:
s8kpred_jobs/
└── 20250210_153042_a1b2c3/
├── FASTA/
│ └── input_sequence.fasta # combined input
├── pssm_outputs/
│ ├── Seq_1.pssm # raw PSI-BLAST PSSM
│ └── ...
├── PSSM_Features_ML_17W.csv # sliding-window PSSM features
├── ResultThreeState.ss2 # PSIPRED-style vertical format
├── ResultThreeState.horiz ├── ResultThreeState.csv # per-residue probabilities # PSIPRED-style horizontal format
├── ResultThreeState.fas # pseudo-FASTA format
├── ResultThreeState.csv # per-residue probabilities
├── ResultEightState.ss2
├── ResultEightState.horiz
├── ResultEightState.fas
├── ResultEightState.csv
├── Seq_1_cartoon.png # helix/sheet cartoon (requires biotite)
└── log.dat # timing and status log
Secondary structure codes
| Code | State |
|---|---|
| 3-state | |
| H | α-Helix |
| E | β-Strand |
| L | Loop / Coil |
| 8-state | |
| H | α-Helix |
| G | 3₁₀-Helix |
| I | π-Helix |
| E | β-Strand |
| B | β-Bridge |
| T | Turn |
| S | Bend |
| L | Loop / Coil |
Environment variables
| Variable | Default | Description |
|---|---|---|
S8KPRED_BLASTDB |
(empty) | BLAST database path prefix |
S8KPRED_PSIBLAST |
psiblast |
PSI-BLAST binary path |
S8KPRED_ITERATIONS |
3 |
PSI-BLAST iterations |
CLI reference
s8kpred predict --help
usage: s8kpred predict [-h] (-i FASTA [FASTA ...] | -s SEQ)
[--blastdb DB] [--psiblast BIN] [--iterations N]
[--threads N] [-o DIR] [--job ID]
[--model-3state PATH] [--model-8state PATH]
[--no-3state] [--no-8state] [--no-plot] [-q]
[--id ID]
Citation
If you use S8kPred in your research, please cite:
[Your citation here]
License
MIT — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s8kpred-0.1.0.tar.gz.
File metadata
- Download URL: s8kpred-0.1.0.tar.gz
- Upload date:
- Size: 914.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad3e3d0b2fc2176d69a32e00bf8d512da889c83e299d2d1e05d2381b213b94c5
|
|
| MD5 |
7a7cac30c1d277d0d4ec7814f915dc45
|
|
| BLAKE2b-256 |
fa6d42e9198f8053c9299f1ac9f8fe84b3a1451377559ee4b48a0eeab6426a08
|
File details
Details for the file s8kpred-0.1.0-py3-none-any.whl.
File metadata
- Download URL: s8kpred-0.1.0-py3-none-any.whl
- Upload date:
- Size: 912.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8194500c5a55f9c2c1c46b699a64f234d6139292630474740dc77edaec31d6bd
|
|
| MD5 |
2bdcc212209320ff8c3986bbe22e1e8b
|
|
| BLAKE2b-256 |
d4a238ae7af5ebbdf4020504e42af86e91edf9f9cd113255ae9dcbb3b0d9bdc4
|