Skip to main content

Protein annotation using local PSSM databases from CDD

Project description

local-cd-search

A command-line tool for local protein domain annotation using NCBI's Conserved Domain Database (CDD).

Background

NCBI CD-Search is a widely used tool for functional annotation of proteins. It uses RPS-BLAST to search protein sequences against position-specific scoring matrices (PSSMs) from the CDD database. PSSMs offer higher sensitivity for detecting distant homologs than searches against individual protein sequences, while remaining substantially faster than HMM-based annotation.

While the CD-Search web interface is convenient for small queries, it is not well suited for large-scale annotation. local-cd-search enables local protein annotation and automates the entire workflow: downloading PSSM databases from CDD, running RPS-BLAST, post-processing results with rpsbproc to filter hits using CDD's curated bit-score thresholds.

Installation

The easiest way to install local-cd-search is with Pixi, which will manage dependencies automatically and make local-cd-search available for execution from anywhere.

pixi global install -c conda-forge -c bioconda local-cd-search

Alternatively, you can install it from PyPI. In this case, rpsblast and rpsbproc must be installed separately. To install local-cd-search from PyPI using uv, run:

uv tool install local-cd-search

Quick start

Download PSSM databases

Download the full CDD database, which is a collection of six individual databases (see table below):

local-cd-search download database cdd

Or download individual databases. For example:

# COG database
local-cd-search download database cog

# Multiple databases
local-cd-search download database cog pfam tigr

The databases available for download are:

Database Name Description
cdd CDD Collection of PSSMs derived from multiple sources (all databases listed below except KOG)
cdd_ncbi NCBI-curated domains Domain models that leverage 3D structural data to define precise boundaries
cog COG Groups of orthologous Prokaryotic proteins
kog KOG Groups of orthologous Eukaryotic proteins
pfam Pfam Large collection of protein families and domains from diverse taxa
prk PRK NCBI collection of protein clusters containing reference sequences from prokaryotic genomes
smart SMART Models of domains from proteins involved in signaling, extracellular, and regulatory functions
tigr TIGRFAM Manually curated models for functional annotation of microbial proteins

Annotate proteins

To run annotation on a FASTA file of protein sequences (in this example, proteins.faa) and save results to results.tsv, run the following command:

local-cd-search annotate proteins.faa results.tsv database

The local-cd-search will automatically detect which databases are available and will them for annotation.

Output

The output of local-cd-search annotate is a tab-separated file with hits filtered by CDD's curated bit-score thresholds. The following columns are included:

Column Description
query Protein identifier
hit_type Specific, Non-specific, or Superfamily
pssm_id CDD PSSM identifier
from Start position in query
to End position in query
evalue E-value
bitscore Bit score
accession Domain accession
short_name Domain short name (e.g., COG0001)
incomplete Indicates if there are more than 20% missing from the N- or C- terminal ends (-, N, C, or NC)
superfamily_pssm_id Superfamily PSSM identifier

hit_type

  • Specific: The top-ranking RPS-BLAST hit (compared to other hits in overlapping intervals) that meets or exceeds a domain-specific E-value threshold. It represents a very high confidence that the query sequence belongs to the same protein family as the sequences used to create the domain model.
  • Non-specific: Hits that meet or exceed the RPS-BLAST threshold for statistical significance (default E-value cutoff of 0.01).
  • Superfamily: The domain cluster to which the specific and/or non-specific hits belong. This is a set of conserved domain models that generate overlapping annotation on the same protein sequences and are assumed to represent evolutionarily related domains.

incomplete

Indicates if there are more than 20% missing from the N- or C- terminal compared to the original domain. Possible values are:

  • -: No more than 20% shorter on either terminals.
  • N: N-terminal has 20% or more missing.
  • C: C-terminal has 20% or more missing.
  • NC: Both terminals have 20% or more missing.

Functional sites output

If --sites-output is specified, an additional tab-separated file is created with functional site annotations. The following columns are included:

Column Description
query Protein identifier
annot_type Specific or Generic
title Description of the functional site
coordinates Residues and their positions (e.g., H94,Y96)
complete_size Total number of residues in the site
mapped_size Number of residues mapped to the query
source_domain PSSM ID of the domain where the site is defined

Usage

download subcommand

local-cd-search download [OPTIONS] DB_DIR DATABASE...
Option Short Argument Description Default
--force flag Force re-download even if files are already present.
--quiet flag Suppress non-error console output.
--help -h flag Show help message and exit.

annotate subcommand

local-cd-search annotate [OPTIONS] INPUT_FILE OUTPUT_FILE DB_DIR
Option Short Argument Description Default
--evalue -e FLOAT (≥ 0) Maximum allowed E-value for hits. 0.01
--ns flag Include non-specific hits in the output results table.
--sf flag Include superfamily hits in the output results table.
--gs flag Include generic site hits in the output sites table.
--threads INTEGER Number of threads to use for rpsblast. 0
--sites-output -s FILE Path to write functional site annotations.
--data-mode -m std | rep | full Redundancy level of domain hit data passed to rpsbproc: rep (best model per region of the query), std (best model per source per region), full (all models meeting E-value significance). std
--tmp-dir DIRECTORY Directory to store intermediate files. If not specified, temporary files will be deleted after execution.
--quiet flag Suppress non-error console output.
--help -h flag Show help message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_cd_search-0.3.1.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

local_cd_search-0.3.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file local_cd_search-0.3.1.tar.gz.

File metadata

  • Download URL: local_cd_search-0.3.1.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for local_cd_search-0.3.1.tar.gz
Algorithm Hash digest
SHA256 aa485c81700b7bd2c187ee0dbaeece87d225780ec62b01cf9927b37ebb02bac9
MD5 b9411c4b10c0babfdd0fab22a5320ec5
BLAKE2b-256 f1bd977ab69a45cca61c0b9fdd8126a4aa40d74ca26d89bd8cbd8f0f15cd48e2

See more details on using hashes here.

File details

Details for the file local_cd_search-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for local_cd_search-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 98820757c264532d0aea21a3cfb4f7903ea2c64ecbae9b57f37c85d393669093
MD5 7344dce00624206c6880b7a68911e273
BLAKE2b-256 5959bb18fad430e40f345dc8607c9f7a47de48e29eef754886014c4d839a8f7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page