Skip to main content

Generating sets of random DNA sequences optimized for use in high-throughput sequencing.

Project description

🔴🟢🔵⚫️ monte barcode

Generating sets of random DNA sequences optimized for use in high-throughput sequencing.

Installation

The easy way

Install the pre-compiled version from PyPI:

pip install monte-barcode

From source

Clone the repository, then cd into it. Then run:

pip install -e .

Usage

monte barcode provides command line utilities to generate completely random or peptide-encoding barcodes conforming to custom contraints, like minimum edit distance among the set, GC content, and color balance for Illumina chemistry.

Barcode sets and individual barcodes are deterministically given an adjective-noun mnemonic (generated by nemony) for easy reference.

Each utility gives a lot of commentary to stderr, but the barcodes go to stdout by default so they can be piped.

Command line

Generate random barcodes of a particular length.

$ monte barcode --length 6 -n 5
Generating barcodes with the following parameters:
       ...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 16 barcodes, rejected 11, accepted 5; rejection rate is 0.69

Rejection reasons:
        gc_content: 0.62
        homopolymer: 0.25
        restriction_sites: 0.06
mighty_orchid:l6-n5-d3:x0:fresh_prague  TGAGGT
mighty_orchid:l6-n5-d3:x1:flexible_forest       AGTTCG
mighty_orchid:l6-n5-d3:x2:fun_baby      GACATC
mighty_orchid:l6-n5-d3:x3:woolly_podium TGTCCT
mighty_orchid:l6-n5-d3:x4:strong_factor GAACCA
Wrote barcode set called mighty_orchid, with minimum Hamming distance 3 and maximum Hamming distance 6.

Or encoding a peptide.

$ monte barcode --amino-acid HELP -n 5
Generating barcodes with the following parameters:
        ...
Using amino acid sequence HELP with length 12 and 96 possible combinations.
> Tried 7 barcodes, rejected 2, accepted 5; rejection rate is 0.29

Rejection reasons:
        gc_content: 0.14
        homopolymer: 0.14
basic_hamlet:l12-n5-d2:x0:volatile_lesson       CATGAGCTGCCT
basic_hamlet:l12-n5-d2:x1:pricy_scuba   CACGAACTGCCT
basic_hamlet:l12-n5-d2:x2:good_race     CACGAATTGCCA
basic_hamlet:l12-n5-d2:x3:demanding_bruno       CATGAATTACCG
basic_hamlet:l12-n5-d2:x4:pawky_plaster CATGAGTTACCT
Wrote barcode set called basic_hamlet, with minimum Hamming distance 2 and maximum Hamming distance 4.

Insist on a minimum edit distance.

$ monte barcode --length 6 -n 10 -d 3
Generating barcodes with the following parameters:
       ...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 39 barcodes, rejected 29, accepted 10; rejection rate is 0.74

Rejection reasons:
        gc_content: 0.67
        distance: 0.13
        homopolymer: 0.05
scenic_blast:l6-n10-d3:x0:acidic_turtle TGTGTG
scenic_blast:l6-n10-d3:x1:rowdy_grace   ACCATC
scenic_blast:l6-n10-d3:x2:rich_export   CGTTAG
scenic_blast:l6-n10-d3:x3:unique_break  GGAATC
scenic_blast:l6-n10-d3:x4:careful_fuji  GCAAGT
scenic_blast:l6-n10-d3:x5:whimsical_derby       CGGAAT
scenic_blast:l6-n10-d3:x6:pricy_aloha   TTCTCC
scenic_blast:l6-n10-d3:x7:zestful_ricardo       AGAGCT
scenic_blast:l6-n10-d3:x8:terse_cobra   AAGTCC
scenic_blast:l6-n10-d3:x9:zany_chamber  TTACGG
Wrote barcode set called scenic_blast, with minimum Hamming distance 3 and maximum Hamming distance 6.

Or insist on ideal color balance for Illumina chemistry.

$ monte barcode --length 6 -n 10 -d 3 --color
Generating barcodes with the following parameters:
        ...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 151 barcodes, rejected 141, accepted 10; rejection rate is 0.93

Rejection reasons:
        gc_content: 0.65
        homopolymer: 0.21
        color_balance: 0.72
        distance: 0.17
        palindrome: 0.02
bright_cliff:l6-n10-d3:x0:ultimate_spray        AGCGAT
bright_cliff:l6-n10-d3:x1:bulky_drama   AGTTGC
bright_cliff:l6-n10-d3:x2:tropical_pinball      TTCACG
bright_cliff:l6-n10-d3:x3:unique_info   GTACGT
bright_cliff:l6-n10-d3:x4:chilly_sahara CCTCTT
bright_cliff:l6-n10-d3:x5:novel_wisdom  GACCTA
bright_cliff:l6-n10-d3:x6:oceanic_plume AGACTG
bright_cliff:l6-n10-d3:x7:wanted_jessica        TCTCGA
bright_cliff:l6-n10-d3:x8:incise_radical        TCTGTC
bright_cliff:l6-n10-d3:x9:rebel_option  TAGGAC
Wrote barcode set called bright_cliff, with minimum Hamming distance 3 and maximum Hamming distance 6.

You can also check and filter previously generated sets.

$ monte barcode --length 6 -n 10 -d 3 2> /dev/null | monte check --color --field 2
Checking barcodes with the following parameters:
        ...
> Tried 10 barcodes, rejected 6, accepted 4; rejection rate is 0.60
Rejection reasons:
        color_balance: 0.60
Could only generate 4 barcodes, but 10 were requested. You might need to try different settings.
thorough_adam:l6-n4-d4:x0:savvy_ruby    TCCTGA
thorough_adam:l6-n4-d4:x1:elfin_rufus   AGCTTC
thorough_adam:l6-n4-d4:x2:damaged_atlas AAGGCA
thorough_adam:l6-n4-d4:x3:faded_elite   GCACTA
Wrote barcode set called thorough_adam, with minimum Hamming distance 4 and maximum Hamming distance 5.

And try to sort by ideal color balance for Illumina chemistries (if you want to use subsets).

$ monte barcode --length 6 -n 15 -d 1 2> /dev/null | monte sort --field 2
Sorting barcodes with the following parameters:
        ...
round_mono:l6-n15-d2:x0:shady_soda      AGTCCT
round_mono:l6-n15-d2:x1:vogue_cosmos    TGAGTC
round_mono:l6-n15-d2:x2:upbeat_baboon   AACGGA
round_mono:l6-n15-d2:x3:sweet_octavia   CATCCT
round_mono:l6-n15-d2:x4:clean_copper    CCTTAG
round_mono:l6-n15-d2:x5:fabulous_partner        TCCTAG
round_mono:l6-n15-d2:x6:defiant_charlie GAACGA
round_mono:l6-n15-d2:x7:misty_miguel    GCATGA
round_mono:l6-n15-d2:x8:urgent_rodeo    ACTGTG
round_mono:l6-n15-d2:x9:injured_news    GAAGGT
round_mono:l6-n15-d2:x10:clear_public   TGAGAG
round_mono:l6-n15-d2:x11:seemly_satire  GATTGG
round_mono:l6-n15-d2:x12:exemplary_robert       TTCAGC
round_mono:l6-n15-d2:x13:nuclear_choice CATCAC
round_mono:l6-n15-d2:x14:discreet_shake GCATTG
Wrote barcode set called round_mono, with minimum Hamming distance 2 and maximum Hamming distance 6.

Details

usage: monte barcode [-h] --number NUMBER [--length LENGTH] [--rejection-rate REJECTION_RATE]
                     [--amino-acid AMINO_ACID] [--distance DISTANCE] [--homopolymer HOMOPOLYMER]
                     [--levenshtein] [--color] [--gc_min GC_MIN] [--gc_max GC_MAX]
                     [--output OUTPUT]

options:
  -h, --help            show this help message and exit
  --number NUMBER, -n NUMBER
                        Number of barcodes to generate. Required.
  --length LENGTH, -l LENGTH
                        Barcode length. Default: 12
  --rejection-rate REJECTION_RATE, -r REJECTION_RATE
                        Rate of rejection before aborting. Default: 0.85
  --amino-acid AMINO_ACID, -a AMINO_ACID
                        Generate barcodes encoding this amino acid sequence. Default: do not use.
  --distance DISTANCE, -d DISTANCE
                        Minimum distance between barcodes. Default: 1
  --homopolymer HOMOPOLYMER, -p HOMOPOLYMER
                        Maximum homopolymer length. Default: 3
  --levenshtein, -e     Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
  --color, -c           Check optimal Illumina color balance. Default: False
  --gc_min GC_MIN, -g GC_MIN
                        Minimum GC content. Default: 0.4
  --gc_max GC_MAX, -j GC_MAX
                        Maximum GC content. Default: 0.6
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT
usage: monte check [-h] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein]
                   [--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
                   [input]

positional arguments:
  input                 Input file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --distance DISTANCE, -d DISTANCE
                        Minimum distance between barcodes. Default: 1
  --homopolymer HOMOPOLYMER, -p HOMOPOLYMER
                        Maximum homopolymer length. Default: 3
  --levenshtein, -e     Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
  --color, -c           Check optimal Illumina color balance. Default: False
  --gc_min GC_MIN, -g GC_MIN
                        Minimum GC content. Default: 0.4
  --gc_max GC_MAX, -j GC_MAX
                        Maximum GC content. Default: 0.6
  --field FIELD, -f FIELD
                        Column number for barcode sequences. Default: 1
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT
usage: monte sort [-h] [--field FIELD] [--output OUTPUT] [input]

positional arguments:
  input                 Input file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --field FIELD, -f FIELD
                        Column number for barcode sequences. Default: 1
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT

Python API

monte-barcode can be imported into Python to generate and check barcodes in your own programs.

import montebarcode as mb

Generate random DNA sequences.

>>> for bc in mb.infinite_barcodes(length=20, check_used=False): 
...     print(bc)
...     break
... 
ATCAGTCGTCACACTAGTTA

Or peptide-encoding sequences.

>>> list(mb.codon_barcodes("L", ordered=True)) 
['CTT', 'CTC', 'CTA', 'CTG', 'TTA', 'TTG']

You can check the minimum and maximum distances among a set.

>>> mb.minmax_distance(['AAA', 'AAA'])
(0, 0)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAT'])
(1, 3)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAAT'], use_levenshtein=False)
(0, 3)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAAT'])
(1, 4)

And get usage of each base at each position.

>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[0]['A']
0.25
>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[1]['G']
0
>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[2]['A']
0.5

You can see whether adding a barcode to a set would throw off the Illumina color balance.

>>> mb.IlluminaColorBalance()('AAAT', ['TCGC', 'ACAG', 'TGGC', 'ATCG'])
True
>>> mb.IlluminaColorBalance()('AAAT', ['TCGC', 'CCAG', 'TGGC', 'ATCG'])
False

And run a suite of checks against a set of barcodes (or infinite stream), retrieving failure reasons, number of tries, and conforming barcode set.

>>> checks = [mb.Homopolymer(), mb.Palindrome()]
>>> mb.make_checks(['AAAAT', 'CCCGGG', 'ATCGCG', 'GCCGAT'], n=4, checks=checks, quiet=True)
(Counter({'homopolymer': 1, 'palindrome': 1}), 4, ['ATCGCG', 'GCCGAT'])
>>> mb.make_checks(['AAAAT', 'CCCGGG', 'ATCGCG', 'GCCGAT'], n=1, checks=checks, quiet=True)
(Counter({'homopolymer': 1, 'palindrome': 1}), 3, ['ATCGCG'])

Documentation

Full API documnetation is here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

monte-barcode-0.0.1.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

monte_barcode-0.0.1-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file monte-barcode-0.0.1.tar.gz.

File metadata

  • Download URL: monte-barcode-0.0.1.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for monte-barcode-0.0.1.tar.gz
Algorithm Hash digest
SHA256 854fe74e102748efe815e2e1f1bf8715fc269e153011c952fdd4a83589694207
MD5 a249f4fbf0afbb6ceb750d1392396a97
BLAKE2b-256 bcfdb9203ac1e1077fb263c47dc3cf2a85b1849868f3532ca82c478be2d2204b

See more details on using hashes here.

Provenance

File details

Details for the file monte_barcode-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for monte_barcode-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f1d214dcb1124fb8c3af677465d5cb5859a3b3e2d5f2c32cf41962fabbb01f4f
MD5 033c44595a969b0db301c7f130d93b52
BLAKE2b-256 7b21c7fa216dfb025a56e4dcb65d31abe3f81ddb9cc53dd226d723c439a81b91

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page