Skip to main content

Dynamic multi-loci/multi-repeat tract microsatellite reference sequence generator

Project description

RefGeneratr: Dynamic multi-loci/multi-repeat tract microsatellite reference sequence generator

RefGeneratr (generatr) is a python script/package which generates a reference genetic sequence (*.fasta) for use in sequence alignment. Microsatellite repeat regions can vary in scope and loci count, so this software has the ability to dynamically handle an undetermined amount of repeat regions within each loci, with intervening sequences if desired. Endusers can specify as many regions/loci as desired, through a simple XML document. This is parsed, and output in the standard *.fasta format is provided.

Generatr requires lxml, which setuptools should install for you during setup.

What's New

Everything

Installation Prerequisites

Assuming that lxml is installed, or you wish setuptools to handle installation for you, the following should suffice. For now, download the source and run:

$ python setup.py install

You may or may not required sudo, it depends on your system. This will install the package for you, so it can be launched with 'generatr' from the command line. Eventually, the package will be uploaded onto PIP so that you can install directly from a terminal.

Hardware Requirements

Nothing spectacular, any computer should run it fine. However, if you desire to generate a reference with a large amount of repeat units and/or loci, available system memory will be a bottleneck.

Usage

Here's how to use generatr:

$ generatr [-v/--verbose] [-i/--input <Path to input.xml>] [-o/--output <Desired *.fasta file output>]

-v enables terminal user feedback.

-i is a path to an XML file containing your desired information, which adheres to the requirements outlined below.

-o is a path to your desired output .fasta/.fa/*.fas file.

XML Requirements

An example XML file is as follows:

<?xml version="1.0"?>
<data>
    <loci label="example_loci_one">
        <input type="fiveprime" flank="GCGACCCTGGAAAAGCTGATGAAGGCCTTCGAGTCCCTCAAGTCCTTC"/>
        <input type="repeat_region" order="1" unit="CAG" start="1" end="100"/>
        <input type="intervening" sequence="CAACAGCCGCCA" prior="1"/>
        <input type="repeat_region" order="2" unit="CCG" start="1" end="20"/>
        <input type="threeprime" flank="CCTCCTCAGCTTCCTCAGCCGCCGCCGCAGGCACAGCCGCTGCT"/>
    </loci>
</data>

The input regions have been made as straight forward as possible. If you desire multiple loci within one reference file, additional tags should be presented, with the respective sequence parameters nested within. There is technically no limitation on how many loci you can specify, although testing has not gone beyond any reasonable figures.

The possible sequence parameters are as follows:

<input type="fiveprime" flank="<string>"/>

This is the input for a five prime flank sequence. The 'type' must be 'fiveprime', and any valid sequence can be present within the 'flank' variable. Valid sequence is a string that consists of A,G,C,T,U,N. No other characters are considered valid.

<input type="repeat_region" order="<integer>" unit="<string>" start="<integer>" end="<integer>"/>

This is the input for a repeat region. The order flag indicates where in the 'sequence' it resides. Unit equates to the repeated unit of sequence, and start/end are integers for the range you wish this repeat unit to repeat over. Generatr is useful as it can handle an unspecified number of repeat regions for each loci.

<input type="intervening" sequence="<string>" prior="<integer>"

The intervening flag is for interrupted repeat sequences. Your intervening sequence is specified under 'sequence', and the repeat_region which this intervening sequence follows, is indicated in 'prior'. E.G. if an intervening sequence follows a repeat_region that was order="1", the intervening prior value would also be "1". Generatr can handle zero, one or multiple intervening sequences; the only stipulation for the sequence to appear correctly is for the user to accurately input the preceeding repeat_region's 'order' value under the respective intervening region's 'prior' value.

<input type="threeprime" flank="<string>"/>

The input for a three prime flank follows the same logic as described for five prime.

Thanks for reading. If you have any questions or trouble with installation, please feel free to e-mail me at alastair[dot]maxwell[at]glasgow[dot]ac[dot]uk.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generatr-1.0.tar.gz (9.3 kB view details)

Uploaded Source

File details

Details for the file generatr-1.0.tar.gz.

File metadata

  • Download URL: generatr-1.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for generatr-1.0.tar.gz
Algorithm Hash digest
SHA256 2f1cba8dc693a2e8d2895724d220d90d11ca2b10cb9b3dbb2bac5aa729c0413a
MD5 3f998af85746050bf5268da28787161c
BLAKE2b-256 5326f571a3107f28175378498d6e50e442820e81075e12c6ea1cf7b0bec01c7c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page