CD-HIT cluster parser

These details have not been verified by PyPI

Project links

Project description

cdhit-parser

PyPI

CD-HIT file reader.

Read CD-HIT .clstr file

Basic usage

from cdhit_reader import  read_cdhit
input = "cluster.fa.clstr"
for cluster in read_cdhit(input):
    print(f"{cluster.name} refSequence={cluster.refname} size={len(cluster)}")

    for member in cluster.sequences:
        print(f" {member.name} ({member.length}) identity={member.identity}% {'(Reference sequence)' if member.is_ref else ''}")

Load all clusters in to a list:

# Load all clusters to a list
clusters = read_cdhit(input).read_items()

Sequence to cluster lookup

Clustering loads a whole .clstr file and provides a reverse index (seqcluster) mapping each sequence name to the name of its cluster:

from cdhit_reader import Clustering

clustering = Clustering.from_file(input)

print(len(clustering))                      # number of clusters
for cluster in clustering:                  # iterate over the clusters
    print(cluster.name)

# Which cluster does a sequence belong to?
print(clustering.seqcluster["seq1.A"])      # e.g. "Cluster 0"

Read FASTA file

if os.path.exists(fileName):
    for seq in cdhit_reader.read_fasta(fileName, line_len=60):
        print(seq) # will be wrapped at 60 chars per line, use 0 to disable wrapping
        
        # to access individual attributes:
        # print(">" + seq.name + " " + seq.comment + "\n" + seq.sequence)

Install

pip install cdhit-reader

or via Miniconda, which will also install cd-hit

conda install -c bioconda -c conda-forge cdhit-reader

Demo applications

Cluster stats

The module ships a demo program called cdhit-reader.py.

cdhit-parser -h

Compare two fasta files

:warning: This requires cd-hit installed and available in the system path.

cdhit-compare allows to compare two fasta files and print the sequences that are in common, those which are only present in one of the files or those which are redundant.

cdhit-compare --help

Example:

cdhit-compare data/input1.faa data/input2.faa  --id 0.99

will produce:

input1  BJJOHBJ_00007
input2  BJJOHBJ_00007
input2  BJJOHBJ_00002
both    BJJOHBJ_00003:BJJOHBJ_00003
both    BJJOHBJ_00005:BJJOHBJ_00005
both    BJJOHBJ_00004:BJJOHBJ_00004
multi   input1#IBJJOHBJ_00006,input1#BBJJOHBJ_000B6,input1#CBJJOHBJ_000C6,input2#IBJJOHBJ_00006,input2#BBJJOHBJ_000B6,input2#CBJJOHBJ_000C6
dupl_input1     BJJOHBJ_00001:BJJOHBJ_000F

where records starting with file1 or file2 are only present in one of the files, records starting with both are present in both files (one per file), records starting with dupl are duplicates (two in one of the files), and records starting with multi are present multiple times in at least one of the datasets.

Author

Andrea Telatin

License

This project is licensed under the MIT License.

Acknowledgments

This module was based on fasta_reader by Danilo Horta

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.0

Jul 21, 2026

0.3.0

Apr 14, 2026

0.2.0

Jul 20, 2023

0.1.1

Nov 25, 2022

0.1.0

Aug 23, 2022

0.0.6

Aug 18, 2022

0.0.5

Apr 5, 2022

0.0.4

Mar 7, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdhit_reader-0.5.0.tar.gz (20.7 kB view details)

Uploaded Jul 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cdhit_reader-0.5.0-py3-none-any.whl (24.3 kB view details)

Uploaded Jul 21, 2026 Python 3

File details

Details for the file cdhit_reader-0.5.0.tar.gz.

File metadata

Download URL: cdhit_reader-0.5.0.tar.gz
Upload date: Jul 21, 2026
Size: 20.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for cdhit_reader-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`3f55e21ab86b9bc4569ba7e5967be978448edb8569e38618917fc83d68a978a3`
MD5	`9c0d5a8d9718bae3e8f29d53dbf0ac47`
BLAKE2b-256	`1b1444cb4ace3dcfacb7ea17603f0a4810fcd18b5fffbc1a6c67c8b7bc8c1e22`

See more details on using hashes here.

File details

Details for the file cdhit_reader-0.5.0-py3-none-any.whl.

File metadata

Download URL: cdhit_reader-0.5.0-py3-none-any.whl
Upload date: Jul 21, 2026
Size: 24.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for cdhit_reader-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a65b91f2d7aad7938a9c7ba4d4980b7bd8e4fc72048e6a4e37fa98efce64e9b2`
MD5	`f8443bc734b9c9beb74594c8f9c8f15d`
BLAKE2b-256	`3feb5fa8df453d90c763c46797f65cec0de8b3c76ffa04aa945eef1f176b35bf`

See more details on using hashes here.

cdhit-reader 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cdhit-parser

Read CD-HIT .clstr file

Sequence to cluster lookup

Read FASTA file

Install

Demo applications

Cluster stats

Compare two fasta files

Author

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes