Skip to main content

Calculates seguid, lseguid & cseguid checksums for biological sequences

Project description

Seguid_calculator

Seguid calculator is a small GUI for calculating the SEGUID, lSEGUID and cSEGUID checksums for a biological sequence (DNA, RNA or protein).

Installation

Executables are available from here: releases:

  • Windows 64 bit
  • Mac OSX app
  • Linux DEB and RPM packages are planned (see end of this page)

These packages are build automatically, see the end of this page for details.

Source installation

setuptools (pip) or conda packages can be installed like this:

pip install seguid_calculator

This should work well on Windows and MacOSX. On Linux, wxpython has to be installed separately.

Alternatively, there is a conda package that should install on all platforms on python 3.5, 3.6 or 3.7:

conda install -c bjornfjohansson seguid_calculator

Visit the website Bjorn Johansson's group at CBMA for more information.

What does it do ?

The SEGUID checksum is defined as the SHA-1 cryptographic hash of a primary biological sequence in uppercase. SEGUID was suggested by Babnigg and Giometti as a way to provide stable identifiers of protein sequences in databases for cross referencing.

There are several implementations of SEGUID calculation available, such as the one in Biopython. Bio.SeqUtils.CheckSum. See slides and the Biopython wiki. See also this blog post on the subject.

The lSEGUID is the SEGUID of the lexicographically smallest of the sense or anti-sense strands of a blunt double stranded DNA sequence. This means that if a sequence and its reverse compliment have the same lSEGUIDs. This can be useful to identify double stranded DNA sequences, regardless of the form they are presented.

Circular SEGUID or cSEGUID is the SEGUID checksum for circular (DNA) sequences. As there are many permutations of a circular sequence, the use of the SEGUID checksum directly is impractical as there would be many checksums for the same sequence.The cSEGUID is the SEGUID of the lexicographically minimal string rotation of a sequence or its reverse complement (whichever is smaller). The cSEGUID provide a unique and stable identifier for circular sequences, such as plasmids.

Example

The cSEGUID checksum can be useful to quickly determine if two sequences refer to the same vector. The sequence of the plasmid pFA6a-GFPS65T-kanMX6 is available from Genbank and from other sources such as the Forsburg lab, sequence here or here.

Both sequences are the same size and claim to describe the same vector, although the origins seem to have been set differently. Analysis of both sequences in seguid_calculator proves that both sequences are in fact representations of the same sequence by their identical cSEGUIDs:

Genbank

alt text

Forsburg

alt text

Implementation

Seguid_calculator is written in python 3 with wxPython 4. Development happens on Github.

Automatic build status

I will try to set up packager.io to build DEB packages for Linux (work in progress).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for seguid-calculator, version 1.2.2
Filename, size File type Python version Upload date Hashes
Filename, size seguid_calculator-1.2.2-py3-none-any.whl (27.1 kB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page