Skip to main content

Protein secondary structure prediction (3-state & 8-state) using XGBoost and PSSM features

Project description

S8kPred — Protein Secondary Structure Prediction

PyPI Python License: MIT

S8kPred predicts protein secondary structure directly from amino acid sequence using XGBoost models trained on PSSM (Position-Specific Scoring Matrix) and tripeptide propensity features.

  • 3-state: Helix (H), Beta-strand (E), Coil/Loop (L)
  • 8-state: H, G, I, E, B, T, S, L (full DSSP alphabet)

Requirements

Dependency Purpose
Python ≥ 3.9 Runtime
numpy, pandas, xgboost, scikit-learn Core ML pipeline
biopython FASTA I/O for PSI-BLAST
NCBI PSI-BLAST PSSM generation (external binary)
UniRef50 (or similar) BLAST database PSI-BLAST database
biotite, matplotlib (optional) Cartoon structure plots

Installation

From PyPI (recommended)

pip install s8kpred

With cartoon plot support

pip install s8kpred[plot]

From GitHub (latest development version)

pip install git+https://github.com/mayank2801/s8kpred.git

From source

git clone https://github.com/mayank2801/s8kpred.git
cd s8kpred
pip install -e .          # editable install
pip install -e .[plot]    # with plotting extras

Setting up PSI-BLAST

S8kPred requires NCBI PSI-BLAST to generate evolutionary features. You have two options:

Option A — System install

# Ubuntu / Debian
sudo apt install ncbi-blast+

# macOS (Homebrew)
brew install blast

# Conda
conda install -c bioconda -c conda-forge "blast>=2.14"

Option B — Manual download

Download the NCBI BLAST+ toolkit from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Then either add the bin/ folder to your PATH or pass the full path via --psiblast.


Setting up a BLAST database

S8kPred works best with UniRef50. Download and format it:

# Download
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz
gunzip uniref50.fasta.gz

# Build BLAST database
mkdir -p ~/blast_dbs/uniref50
makeblastdb -in uniref50.fasta \
            -dbtype prot \
            -out ~/blast_dbs/uniref50/uniref50 \
            -title "UniRef50"

Then point s8kpred at it:

export S8KPRED_BLASTDB=~/blast_dbs/uniref50/uniref50

or pass --blastdb ~/blast_dbs/uniref50/uniref50 on every invocation.


Model data files

The trained XGBoost models and lookup tables are not bundled in the PyPI wheel because of their size. Download them from the Releases page and place them in the s8kpred/data/ directory inside your Python environment:

s8kpred/data/
  TriPeptidePropensityThreeStateSecStructure2AND.csv
  TriPeptidePropensityEightStateSecStructure.csv
  TripeptideBinaryTable_60.csv
  model_3state.json
  model_8state.ubj

Or override paths at runtime:

s8kpred predict -i input.fasta \
  --blastdb ~/blast_dbs/uniref50/uniref50 \
  --model-3state /path/to/model_3state.json \
  --model-8state /path/to/model_8state.ubj

Quick start

Command line

# Single FASTA file
s8kpred predict -i protein.fasta --blastdb ~/blast_dbs/uniref50/uniref50

# Multi-sequence FASTA
s8kpred predict -i multi_seq.fasta --blastdb ~/blast_dbs/uniref50/uniref50

# Multiple separate FASTA files in one run
s8kpred predict -i seq1.fasta seq2.fasta seq3.fasta \
                --blastdb ~/blast_dbs/uniref50/uniref50

# Inline sequence (no file needed)
s8kpred predict \
  --sequence MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGD \
  --id my_protein \
  --blastdb ~/blast_dbs/uniref50/uniref50

# Custom output folder and job name
s8kpred predict -i input.fasta \
  --blastdb ~/blast_dbs/uniref50/uniref50 \
  --output-dir ./results \
  --job experiment_01

# Skip 8-state prediction
s8kpred predict -i input.fasta --blastdb ... --no-8state

# Skip cartoon plots
s8kpred predict -i input.fasta --blastdb ... --no-plot

# Quiet mode (suppress progress output)
s8kpred predict -i input.fasta --blastdb ... --quiet

# Use more PSI-BLAST threads
s8kpred predict -i input.fasta --blastdb ... --threads 16

# Override PSI-BLAST location
s8kpred predict -i input.fasta \
  --psiblast /opt/ncbi-blast/bin/psiblast \
  --blastdb ~/blast_dbs/uniref50/uniref50

Python API

from s8kpred import predict, predict_file

# ── Single sequence ──────────────────────────────────────────────────
result = predict(
    sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGD",
    seq_id="my_protein",
    blastdb="/data/blast/uniref50/uniref50",
)

print(result.results_3state["my_protein"])   # e.g. "CCCHHHHHHCCEEEEC..."
print(result.results_8state["my_protein"])   # e.g. "LLLHHHHHHLLEEEELL..."
print(result.job_dir)                         # Path to all output files

# ── Single FASTA file ────────────────────────────────────────────────
result = predict_file(
    fasta_file="proteins.fasta",
    blastdb="/data/blast/uniref50/uniref50",
    output_dir="./results",
)
print(result.summary())

# ── Multi-sequence FASTA ─────────────────────────────────────────────
result = predict_file("multi_seq.fasta", blastdb="...")
for seq_id, ss in result.results_3state.items():
    print(f"{seq_id}: {ss}")

# ── Custom model paths ────────────────────────────────────────────────
from pathlib import Path
result = predict_file(
    "proteins.fasta",
    blastdb="...",
    model_3state=Path("/models/model_3state.json"),
    model_8state=Path("/models/model_8state.ubj"),
)

# ── Skip 8-state to save time ─────────────────────────────────────────
result = predict("MKTAYI...", blastdb="...", run_8state=False)

Output files

All outputs are written to a timestamped job directory under --output-dir:

s8kpred_jobs/
└── 20250210_153042_a1b2c3/
    ├── FASTA/
    │   └── input_sequence.fasta        # combined input
    ├── pssm_outputs/
    │   ├── Seq_1.pssm                  # raw PSI-BLAST PSSM
    │   └── ...
    ├── PSSM_Features_ML_17W.csv        # sliding-window PSSM features
    ├── ResultThreeState.ss2            # PSIPRED-style vertical format
    ├── ResultThreeState.horiz           ├── ResultThreeState.csv            # per-residue probabilities   # PSIPRED-style horizontal format
    ├── ResultThreeState.fas            # pseudo-FASTA format
    ├── ResultThreeState.csv            # per-residue probabilities
    ├── ResultEightState.ss2
    ├── ResultEightState.horiz
    ├── ResultEightState.fas
    ├── ResultEightState.csv
    ├── Seq_1_cartoon.png               # helix/sheet cartoon (requires biotite)
    └── log.dat                         # timing and status log

Secondary structure codes

Code State
3-state
H α-Helix
E β-Strand
L Loop / Coil
8-state
H α-Helix
G 3₁₀-Helix
I π-Helix
E β-Strand
B β-Bridge
T Turn
S Bend
L Loop / Coil

Environment variables

Variable Default Description
S8KPRED_BLASTDB (empty) BLAST database path prefix
S8KPRED_PSIBLAST psiblast PSI-BLAST binary path
S8KPRED_ITERATIONS 3 PSI-BLAST iterations

CLI reference

s8kpred predict --help
usage: s8kpred predict [-h] (-i FASTA [FASTA ...] | -s SEQ)
                       [--blastdb DB] [--psiblast BIN] [--iterations N]
                       [--threads N] [-o DIR] [--job ID]
                       [--model-3state PATH] [--model-8state PATH]
                       [--no-3state] [--no-8state] [--no-plot] [-q]
                       [--id ID]

Citation

If you use S8kPred in your research, please cite:

[Your citation here]


License

MIT — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s8kpred-0.1.0.tar.gz (914.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

s8kpred-0.1.0-py3-none-any.whl (912.6 kB view details)

Uploaded Python 3

File details

Details for the file s8kpred-0.1.0.tar.gz.

File metadata

  • Download URL: s8kpred-0.1.0.tar.gz
  • Upload date:
  • Size: 914.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for s8kpred-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ad3e3d0b2fc2176d69a32e00bf8d512da889c83e299d2d1e05d2381b213b94c5
MD5 7a7cac30c1d277d0d4ec7814f915dc45
BLAKE2b-256 fa6d42e9198f8053c9299f1ac9f8fe84b3a1451377559ee4b48a0eeab6426a08

See more details on using hashes here.

File details

Details for the file s8kpred-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: s8kpred-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 912.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for s8kpred-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8194500c5a55f9c2c1c46b699a64f234d6139292630474740dc77edaec31d6bd
MD5 2bdcc212209320ff8c3986bbe22e1e8b
BLAKE2b-256 d4a238ae7af5ebbdf4020504e42af86e91edf9f9cd113255ae9dcbb3b0d9bdc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page