Skip to main content

CD-HIT cluster parser

Project description

cdhit-parser

Python package Conda downloads pipy PyPI

CD-HIT file reader.

Read CD-HIT .clstr file

Basic usage

from cdhit_reader import  read_cdhit
input = "cluster.fa.clstr"
for cluster in read_cdhit(input):
    print(f"{cluster.name} refSequence={cluster.refname} size={len(cluster)}")

    for member in cluster.sequences:
        print(f" {member.name} ({member.length}) identity={member.identity}% {'(Reference sequence)' if member.is_ref else ''}")

Load all clusters in to a list:

# Load all clusters to a list
clusters = read_cdhit(input).read_items()

Read FASTA file

if os.path.exists(fileName):
    for seq in cdhit_reader.read_fasta(fileName, line_len=60):
        print(seq) # will be wrapped at 60 chars per line, use 0 to disable wrapping
        
        # to access individual attributes:
        # print(">" + seq.name + " " + seq.comment + "\n" + seq.sequence)

Install

pip install cdhit-reader

or via Miniconda, which will also install cd-hit

conda install -c bioconda -c conda-forge cdhit-reader

Demo applications

Cluster stats

The module ships a demo program called cdhit-reader.py.

cdhit-parser -h

Compare two fasta files

:warning: This requires cd-hit installed and available in the system path.

cdhit-compare allows to compare two fasta files and print the sequences that are in common, those which are only present in one of the files or those which are redundant.

cdhit-compare --help

Example:

cdhit-compare data/input1.faa data/input2.faa  --id 0.99

will produce:

input1  BJJOHBJ_00007
input2  BJJOHBJ_00007
input2  BJJOHBJ_00002
both    BJJOHBJ_00003:BJJOHBJ_00003
both    BJJOHBJ_00005:BJJOHBJ_00005
both    BJJOHBJ_00004:BJJOHBJ_00004
multi   input1#IBJJOHBJ_00006,input1#BBJJOHBJ_000B6,input1#CBJJOHBJ_000C6,input2#IBJJOHBJ_00006,input2#BBJJOHBJ_000B6,input2#CBJJOHBJ_000C6
dupl_input1     BJJOHBJ_00001:BJJOHBJ_000F

where records starting with file1 or file2 are only present in one of the files, records starting with both are present in both files (one per file), records starting with dupl are duplicates (two in one of the files), and records starting with multi are present multiple times in at least one of the datasets.

Author

License

This project is licensed under the MIT License.

Acknowledgments

This module was based on fasta_reader by Danilo Horta

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdhit-reader-0.2.0.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

cdhit_reader-0.2.0-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file cdhit-reader-0.2.0.tar.gz.

File metadata

  • Download URL: cdhit-reader-0.2.0.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for cdhit-reader-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bc9c89ec7adaed66e8ac6abfc0b58ce353bb3de96edee4f2b4d4a1ae27f938e4
MD5 6ef4d4d7d90f8c8f9e5f7816c2a4f68b
BLAKE2b-256 5a05663ba7b1b114653bbf36ca25d5b6a2fdd4432d9cf05406924584187367df

See more details on using hashes here.

File details

Details for the file cdhit_reader-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cdhit_reader-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for cdhit_reader-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 31bb00ca3f7b041a20ab515d1d5a6068535e654e5e2d7acf4b8ff5f8c9767ad0
MD5 b3a6b65d6ca8096151da793e66d72ad9
BLAKE2b-256 f55cbe223264703df248d174236e563deba5d8d1622da715dd0e47b0b1aafd36

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page