A package to automatically generate oligonucleotide library pool sequences
Project description
oligocompose
Jamie Heather | CCR @ MGH | 2024
The aim of this repo is to facilitate the production of large oligo pool library sequences, particularly potential MHC peptide epitope libraries, from shorter substrings.
- Information of sequences to be generated should be supplied in a tab-separated file.
- The first column must contain a name or identifier (which needn't be unique).
- All additional columns contain the sequences to combine in the output oligos, with optional additional comma-delimited arguments providing additional information about each sequence in square brackets after the sequence.
- Different lines may have different numbers of columns; the input file needn't be a 'proper' whole-grid tab-separated file.
- All lines must be left aligned, i.e. with no empty columns separating sequences.
- Comment lines that will not contribute to the output can also be included by beginning the line with a
#character. - This file is referred to via the
-inor--in_fileflags, e.g.:
oligocompose -in my_input_file.tsv
-
Sequences can be provided via one of several other ways, in any combination within a line:
- Explicitly as a simple DNA sequence, or as a amino acid sequence, which will be reverse-translated using the most commonly used codon to encode that residue in humans.
- While the script attempts to automatically infer the sequence type (defaulting to nucleotides), it's preferable to explicitly label each sequence type in the square bracket parameter fields with
[nt]or[aa]. - E.g.
TTT aaa[nt] CCCproduces 'TTTAAACCC', whileTTT aaa[aa] CCCproduces 'TTTGCCGCCGCCAAA', with the codons for three alanines encoded between the TTT and CCC sequences.
- While the script attempts to automatically infer the sequence type (defaulting to nucleotides), it's preferable to explicitly label each sequence type in the square bracket parameter fields with
- Explicitly as a simple DNA sequence, or as a amino acid sequence, which will be reverse-translated using the most commonly used codon to encode that residue in humans.
-
Individual sequences that are to be used frequently (e.g. conserved adapter sequences) can be included in a reference 'fixed sequence' file, which can be specified using the
-for--fixed_fileflag.- This should fit the criteria of the input oligo tsv file, except it should be only two columns, name and reference sequence.
- Those sequences can then be referred to in the input oligo file but using those names, bracketed with ampersands (e.g.
&name_of_sequence&).
oligocompose -in my_input_file.tsv -f fixed_sequences.tsv
-
For variable sequences (e.g. different sequences from a pool), instead of explicitly providing each individual a DNA string, users can provide a path to a file containing a list of sequences can be used, bracketed with dollar signs (e.g.
$path_to/some_file.tsv$)- This file must be formatted the same as the fixed sequence file, i.e. a two column tsv of
name\tsequence. - In this situation an output sequence will be produced for every sequence in the additional file. E.g. when
some_filecontains the sequences AAA, CCC, GGG, the lineTT $some_file$ TTTwould produce TTAAATTT, TTCCCTTT, and TTGGGTTT. - Note that this file cannot contain links to other files, only explicit sequences or references to them.
- This file must be formatted the same as the fixed sequence file, i.e. a two column tsv of
-
Degenerate DNA bases (using IUPAC codes) can also be used.
- E.g.
TT NNN AAwould produce 64 oligos, with each combination of all four nucleotides at each of the three degenerate positions.
- E.g.
-
Instead of just inserting whole sequences (provided either directly or via a file), all sub-sequences of a specified length can be individually combined.
- This is particularly useful for generating oligos tiled along the length of a gene or protein sequence.
- After giving a sequence, identifier, or path, integer values of length of subsequence to take can be specified in square brackets.
- Note that this behaviour defaults to a sliding window of 1 for amino acids, and 3 for nucleotides, assuming that its been provided with an in-frame ORF and that codons are to be maintained. This behaviour can be altered by providing any integer to the script via the
--sl / --step_lengthflag. - E.g.
ACDEFGHIKLMNPQRSTVWY[10,aa]would produce nucleotides encoding every possible 10-mer across this theoretical protein sequence (ACDEFGHIKL,CDEFGHIKLM,DEFGHIKLMN, ...MNPQRSTVWY)- Running the same code with
-sl 5included in the script would instead produce 10-mers every 5 residues apart (justACDEFGHIKL,GHIKLMNPQR, andMNPQRSTVWY).
- Running the same code with
- Alternatively if running on nucleotide sequences, coding sequences can be maintained by providing it with sequence lengths divisible by three
- E.g.
atgaccatgattacgccaagcttgcatgcctgcagg[30,nt](encoding the 12-merMTMITPSLHACRpeptide) would produce oligos encoding every tiled 10-mer peptide.
- E.g.
-
Ordinarily the script produces upper case output (e.g.
ttt a[aa] ccc=TTTGCCCCC), but some options have been provided to more easily show the joins of the different input sequence regions.- The
-cf / --case_flipflag makes sequential sequence fields alternate their case (e.g.TTTgccCCC) - The
--first_caseflag can be used in conjuction with this to specifically call the starting case (e.g.-cf -fc lower=tttGCCccc)
- The
-
There are also options to 3' pad nucleotides to ensure a minimum length (due to the requirements for certain oligo synthesis platforms)
- This can be specified with the
-l / --oligo_lenflag, which will add random 3' nucleotides to the end of any oligos shorter than this. - Alternatively if further post-processing is to take place, additionally providing the
-n / --n_padflag will pad 3's with 'N' characters instead of random nucleotides.
- This can be specified with the
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oligocompose-0.2.1.tar.gz.
File metadata
- Download URL: oligocompose-0.2.1.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc738352ea8fa52a6a5c24b5a07f1a1a909ef896890acda2793e1bb9b8147161
|
|
| MD5 |
43d69ae0425b426fb84a57be491610ff
|
|
| BLAKE2b-256 |
c7312a3d4bb2ab8c6df78e41d60e837a40c56a1493f8189e3ffb24886b659489
|
File details
Details for the file oligocompose-0.2.1-py3-none-any.whl.
File metadata
- Download URL: oligocompose-0.2.1-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9bfbc2f9af928e7a96ab7a1ca7560252f3e3305ae5f2bc6edd3200862131b73
|
|
| MD5 |
ecdc746580ee629f6109cf9113a83503
|
|
| BLAKE2b-256 |
0dc67dac0d94a1097889ab54ca4c41ed2169d4afdafa3e52da25a2bec6314069
|