Skip to main content

fast dbscan clustering on peptide strings

Project description

# fast_dbscan
A lightweight, fast dbscan implementation for use on peptide strings. It uses
pure C for the distance calculations and clustering. This code is then wrapped
in python.

*Note*: as implemented, the software assumes all sequences have the same length.

### Installation

#### pip
```
pip3 install fast_dbscan
```

#### Development version
```
git clone https://github.com/harmslab/fast_dbscan
cd fast_dbscan
sudo python3 setup.py install
```

### Usage

#### Stand-alone
This will install a convenience program called `fast_dbscan` in the path. This
can be invoked on the command line:

```
fast_dbscan filename epsilon [dl]
```

where `filename` is a file that contains sequences of identical length, with one
per line, `epsilon` is the neighborhood distance cutoff (see below), and the
optional argument `dl` says to use the Damerau-Levenshtein distance function
rather than the simple distance function.

#### As library
```
import fast_dbscan

d = fast_dbscan.DBScanWrapper(distance_function='dl')
d.read_file(file_with_sequences)
d.run(epsilon=1,min_neighbors=12)

# Dictionary keying cluster id to sequences
clusters = d.results
```

### Distance functions

+ `simple`: add up entries in a distance matrix based on the identies of letters
at each column in the alignment. Currently, the software uses hamming
distance. This could be easily modified to use other matricies, provided
distances can be calculated as integers. The matrix is populated in
`DBScanWrapper.__init__`.
+ `dl`: Damerau-Levenshtein distance, allowing deletion, insertion, substitution,
and transposition.

### Other parameters
+ `epsilon`: the maximum distance between two samples for them to be considered
within the same neighborhood.
+ `min_neighbors`: the minimum number of sequence neighbors required to define
a cluster

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for fast_dbscan, version 0.0.1
Filename, size File type Python version Upload date Hashes
Filename, size fast_dbscan-0.0.1.tar.gz (6.2 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page