Calculates seguid, lseguid & cseguid checksums for biological sequences

# Seguid_calculator

Seguid calculator is a small GUI for calculating the SEGUID, lSEGUID and cSEGUID checksums for a biological sequence (DNA, RNA or protein).

## Installation

Executables are available from here: releases:

• Windows 64 bit
• Mac OSX app
• Linux DEB and RPM packages are planned (see end of this page)

These packages are build automatically, see the end of this page for details.

## Source installation

setuptools (pip) or conda packages can be installed like this:

pip install seguid_calculator


This should work well on Windows and MacOSX. On Linux, wxpython has to be installed separately.

Alternatively, there is a conda package that should install on all platforms on python 3.5, 3.6 or 3.7:

conda install -c bjornfjohansson seguid_calculator


## What does it do ?

The SEGUID checksum is defined as the SHA-1 cryptographic hash of a primary biological sequence in uppercase. SEGUID was suggested by Babnigg and Giometti as a way to provide stable identifiers of protein sequences in databases for cross referencing.

There are several implementations of SEGUID calculation available, such as the one in Biopython. Bio.SeqUtils.CheckSum. See slides and the Biopython wiki. See also this blog post on the subject.

The lSEGUID is the SEGUID of the lexicographically smallest of the sense or anti-sense strands of a blunt double stranded DNA sequence. This means that if a sequence and its reverse compliment have the same lSEGUIDs. This can be useful to identify double stranded DNA sequences, regardless of the form they are presented.

Circular SEGUID or cSEGUID is the SEGUID checksum for circular (DNA) sequences. As there are many permutations of a circular sequence, the use of the SEGUID checksum directly is impractical as there would be many checksums for the same sequence.The cSEGUID is the SEGUID of the lexicographically minimal string rotation of a sequence or its reverse complement (whichever is smaller). The cSEGUID provide a unique and stable identifier for circular sequences, such as plasmids.

## Example

The cSEGUID checksum can be useful to quickly determine if two sequences refer to the same vector. The sequence of the plasmid pFA6a-GFPS65T-kanMX6 is available from Genbank and from other sources such as the Forsburg lab, sequence here or here.

Both sequences are the same size and claim to describe the same vector, although the origins seem to have been set differently. Analysis of both sequences in seguid_calculator proves that both sequences are in fact representations of the same sequence by their identical cSEGUIDs:

## Implementation

Seguid_calculator is written in python 3 with wxPython 4. Development happens on Github.

## Automatic build status

I will try to set up packager.io to build DEB packages for Linux (work in progress).

## Project details

Uploaded py3