demultiplexs combinatorially tagged reads
Project description
Mr. Demuxy:
Combinotorial Demultixper
INTRODUCTION:
Author: Daniel E. Lefever
Author-email: lefeverde@gmail.com
license: Too be determined Mr. Demuxy is a fairly simple, self-contained program that makes demultiplexing variable length, combinatorially barcoded NGS reads as painless as possible. As such, the goals for Mr_Demuxy were that the final program would require little to no software installation and to be as user-friendly as possible. This program was written with sequencing data generated by following Travis Glenn’s Tagimatrix barcoding strategy [Tagimatrix Ref] in mind, however, it would work on any type of data multiplexed with combinatorial barcodes. In short, this approach uses two pairs of barcodes denoted as the internal and external barcodes (also referred to as tags and indexes). The idea is to pool many different projects onto one MiSeq/HiSeq run cheaply by having external Illumina barcodes to differentiate the pooled projects and internal ‘homebrewed’ variable length barcodes to distinguish samples within each project. The internal barcoding strategy is based on the layout of a 96-well plate: the forward tag corresponds to row letter (A-H) and the reverse tag corresponds to column number (1-12). This program uses both tags to come up with the sample ID. There are two main scripts, called merged_demuxer and pe_demuxer which can handle either demultiplexing a merged fastq file or paired-end fastq files, respectively. DEPENDENCIES: python 2.7.X unix/linux OS INSTALLATION: Step 1: Download Mr. Demuxy and Extract it
After downloading Mr. Demuxy extract the folder contents. To extract file contents, try clicking on it. If that doesn't work, try this:
$ tar -xvzf Mr_Demuxy-1.0.0-dist.tar.gz Step 2: Install package From the command line change dir to wherever the unarchived folder lives. Then install to your local $ python dir like so: $ python setup.py install --user --force Step 3: Create bash aliases Next, run the alias_maker script so you can run this program from anywhere without needing to write out the programs full path. Here's how to run it:
$ python alias_maker.py Step 4: Restart terminal After restarting terminal, you should now be able to call the script from the console like this: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta $ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt
USAGE EXAMPLES: MERGED SINGLE FASTQ
merged_demuxer arguements:
required arguments:
-f forward bcs file
-r reverse bcs file
-i input merged fastq file
optional arguments:
-o name of output file or folder (default:
merged_demuxer_out)
--forward_primer_len base number of forward primer/adapter (default: 0)
--reverse_primer_len
base number of reverse primer/adapter (default: 0)
--no_rev_comp_reverse_bcs
prevents rev bcs from reverse complementing
-h, --help show this help message and exit
--keep_original_headers
pass this arg to keep original headers in
demultiplexed fasta/fastq file
--write_qual write a qual file containing quality scores
--individual_fastq demultiplex by writing a fastq file for each sample ID
--min_seq_length minimum length of seq to be retained (default: 20)
--file_extension format of the output file (default: fasta) ---Demultiplex merged fastq reads w/ output being in fasta format: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta
---Do the same as above and get a seperate qual file:
$ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta --write_qual
---Demultiplex and keep the original fastq headers:
$ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta --keep_original_headers
---Demultiplex merged reads and remove the 17 bp forward primer and 22 bp reverse primer: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta --forward_primer_len 17 --reverse_primer_len 22 ---Do the same as above except have the output be in fastq format: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fastq --forward_primer_len 17 --reverse_primer_len 22 --file_extension fastq ---Demultiplex reads and split into one fastq file per sample ID: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult_ind_fastq --individual_fastq
---Get individual fastq files, chop off the 17 bp forward primer and 22 bp reverse primer, and keep reads that are only
longer than 100 bp:
$ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult_ind_fastq --individual_fastq --forward_primer_len 17 --reverse_primer_len 22 --individual_fastq --min_seq_length 100
USAGE EXAMPLES: TWO PAIRED-END FASTQ FILES
pe_demuxer arguements:
required arguments:
-r1_bc BC file for R1 reads
-r2_bc BC file for R2 reads
-r1 R1 fastq file
-r2 R2 fastq file
optional arguments:
-h, --help show this help message and exit
-o name of output file or dir
--forward_primer_len base number of primer/adapter on R1 seq (default: 0)
--reverse_primer_len
base number of primer/adapter on R2 seq (default: 0)
--keep_original_headers
pass this arg to keep original headers in
demultiplexed fasta/fastq file
--min_seq_length minimum length of seq to be retained (default: 20)
--check_headers ensures the R1 and R2 read match before demultiplexing
WARNING: this is super slow ---Demultiplex paired-end fastq files and write individual fastqs for each sample in their respective directories: $ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt
---Do the same as above and remove a 10 bp primer on R1 reads and a 16 bp primer on R2 reads:
$ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt --forward_primer_len 10 --reverse_primer_len 16
---Demultiplex paired-end reads in which the read order may be different and/or the files contain un-paired reads:
Note: using this option is not recommended and should not be neccessary since Illumina sequencers should produce files with reads in the same order and discard un-paired reads. I didn't realize that until after added this feature so I decided to include it anyways.
$ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt --check_headers
BARCODE FILES:
The forward barcode file, as you might expect, is the file containing the forward barcodes and letter (or whatever the barcode corresponds to). You will probably need to make your own forward barcode file, and the formatting is important. The easieast way to make one, is to open a new sheet in excel and copy/paste the barcode and letter pairs with the first column being the letter and the second containing the barcode. It should look like this: A GGTAC B CAACAC C ATCGGTT D TCGGTCAA NOTE: Both barcode files do not have any header information. I’m not sure if it will screw things up, but it’s safest to not include it. Once you’re done inputing the forward barcodes, save the excel sheet as a ‘tab delineated text’ file. Be sure to name it something so you know what it is. The reverse barcode file is just like the forward barcode file, except for the reverse barcodes. It should look like this: 1 AGGAA 2 GAGTGG 3 CCACGTC 4 TTCTCAGC Once you’ve followed the same process as with the forward barcode file, save it as a ‘tab delineated text’ file. NOTE: This program reverse complements (RC) the reverse barcodes instead of RCing the whole sequence. This makes it go faster, but if your reverse barcodes do not need to be RC'd, pass the --no_rev_comp_reverse_bcs arguement. OUTPUT: The output file/files contain reads that have been demultiplexed, and optionally trimmed of primers. This program automatically removes the barcode from each read and will remove the primers if length is specified by the --forward_primer_len and --reverse_primer_len arguments. Since indels are rare with illumina sequencing, the primers are not actually matched and are instead removed by trimming a set number of bases after the barcode. For example, if the forward primer was 17 bp long, and the reverse 15, pass --forward_primer_len 17 --reverse_primer_len 15 to remove it from each read. If these options are not passed, it defaults to 0. This program can also create individual fastq files split by each sample. To do this, simply pass the --individual_fastq argument. The -o arguement will become the name of the output directory with one fastq per sample. The dir structure is as followd:
-output_dir/
<sample_id_1>.fastq
<sample_id_2>.fastq
.
.
.
<sample_id_n>.fastq
pe_demuxer will only output individual fastq files. The R1 and R2 files are seperated into two folders each containing individual fastq files split by sample. This is the structure:
-output_dir/
R1/
<sample_id_1>_R1.fastq
<sample_id_2>_R1.fastq
.
.
.
<sample_id_n>_R1.fastq
R2/
<sample_id_1>_R2.fastq
<sample_id_2>_R2.fastq
.
.
.
<sample_id_n>_R1.fastq
There are number of different options for the output file. These include: HEADER OPTIONS: -Demultiplexed ID only- --FASTA Format-- >A6_1 AATATT....CACGCAGA --FASTQ Format-- @A6_1 AATATT....CACGCAGA + GGGGGG...@GGFGGFC< -Original header with demultiplexed ID- --FASTA Format-- >M02849:...:1886 A6_1 AATATT....CACGCAGA --FASTQ Format-- @M02849:...:1886 A6_1 AATATT....CACGCAGA + GGGGGG...@GGFGGFC< PROGRAM INFO: This program was written and predominantly tested using a macbook pro 2011 with 8 GB memory and a 2.4 GHz Intel Core i7 processor running OS X El Captitan using python 2.7.10. With this setup, both merged (written to single or individual files) and paired-end demultiplexing of ~7,000,000 reads took around 3-4 minutes. References: Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., . . . de Hoon, M. J. (2009). Bio$ python: freely available $ python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423. doi:10.1093/bioinformatics/btp163
Author-email: lefeverde@gmail.com
license: Too be determined Mr. Demuxy is a fairly simple, self-contained program that makes demultiplexing variable length, combinatorially barcoded NGS reads as painless as possible. As such, the goals for Mr_Demuxy were that the final program would require little to no software installation and to be as user-friendly as possible. This program was written with sequencing data generated by following Travis Glenn’s Tagimatrix barcoding strategy [Tagimatrix Ref] in mind, however, it would work on any type of data multiplexed with combinatorial barcodes. In short, this approach uses two pairs of barcodes denoted as the internal and external barcodes (also referred to as tags and indexes). The idea is to pool many different projects onto one MiSeq/HiSeq run cheaply by having external Illumina barcodes to differentiate the pooled projects and internal ‘homebrewed’ variable length barcodes to distinguish samples within each project. The internal barcoding strategy is based on the layout of a 96-well plate: the forward tag corresponds to row letter (A-H) and the reverse tag corresponds to column number (1-12). This program uses both tags to come up with the sample ID. There are two main scripts, called merged_demuxer and pe_demuxer which can handle either demultiplexing a merged fastq file or paired-end fastq files, respectively. DEPENDENCIES: python 2.7.X unix/linux OS INSTALLATION: Step 1: Download Mr. Demuxy and Extract it
After downloading Mr. Demuxy extract the folder contents. To extract file contents, try clicking on it. If that doesn't work, try this:
$ tar -xvzf Mr_Demuxy-1.0.0-dist.tar.gz Step 2: Install package From the command line change dir to wherever the unarchived folder lives. Then install to your local $ python dir like so: $ python setup.py install --user --force Step 3: Create bash aliases Next, run the alias_maker script so you can run this program from anywhere without needing to write out the programs full path. Here's how to run it:
$ python alias_maker.py Step 4: Restart terminal After restarting terminal, you should now be able to call the script from the console like this: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta $ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt
USAGE EXAMPLES: MERGED SINGLE FASTQ
merged_demuxer arguements:
required arguments:
-f forward bcs file
-r reverse bcs file
-i input merged fastq file
optional arguments:
-o name of output file or folder (default:
merged_demuxer_out)
--forward_primer_len base number of forward primer/adapter (default: 0)
--reverse_primer_len
base number of reverse primer/adapter (default: 0)
--no_rev_comp_reverse_bcs
prevents rev bcs from reverse complementing
-h, --help show this help message and exit
--keep_original_headers
pass this arg to keep original headers in
demultiplexed fasta/fastq file
--write_qual write a qual file containing quality scores
--individual_fastq demultiplex by writing a fastq file for each sample ID
--min_seq_length minimum length of seq to be retained (default: 20)
--file_extension format of the output file (default: fasta) ---Demultiplex merged fastq reads w/ output being in fasta format: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta
---Do the same as above and get a seperate qual file:
$ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta --write_qual
---Demultiplex and keep the original fastq headers:
$ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta --keep_original_headers
---Demultiplex merged reads and remove the 17 bp forward primer and 22 bp reverse primer: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fasta --forward_primer_len 17 --reverse_primer_len 22 ---Do the same as above except have the output be in fastq format: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult.fastq --forward_primer_len 17 --reverse_primer_len 22 --file_extension fastq ---Demultiplex reads and split into one fastq file per sample ID: $ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult_ind_fastq --individual_fastq
---Get individual fastq files, chop off the 17 bp forward primer and 22 bp reverse primer, and keep reads that are only
longer than 100 bp:
$ merged_demuxer -i example_merged.fastq -f forward_bcs.txt -r reverse_bcs.txt -o example_demult_ind_fastq --individual_fastq --forward_primer_len 17 --reverse_primer_len 22 --individual_fastq --min_seq_length 100
USAGE EXAMPLES: TWO PAIRED-END FASTQ FILES
pe_demuxer arguements:
required arguments:
-r1_bc BC file for R1 reads
-r2_bc BC file for R2 reads
-r1 R1 fastq file
-r2 R2 fastq file
optional arguments:
-h, --help show this help message and exit
-o name of output file or dir
--forward_primer_len base number of primer/adapter on R1 seq (default: 0)
--reverse_primer_len
base number of primer/adapter on R2 seq (default: 0)
--keep_original_headers
pass this arg to keep original headers in
demultiplexed fasta/fastq file
--min_seq_length minimum length of seq to be retained (default: 20)
--check_headers ensures the R1 and R2 read match before demultiplexing
WARNING: this is super slow ---Demultiplex paired-end fastq files and write individual fastqs for each sample in their respective directories: $ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt
---Do the same as above and remove a 10 bp primer on R1 reads and a 16 bp primer on R2 reads:
$ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt --forward_primer_len 10 --reverse_primer_len 16
---Demultiplex paired-end reads in which the read order may be different and/or the files contain un-paired reads:
Note: using this option is not recommended and should not be neccessary since Illumina sequencers should produce files with reads in the same order and discard un-paired reads. I didn't realize that until after added this feature so I decided to include it anyways.
$ pe_demuxer -r1 r1_seqs.fastq -r2 r2_seqs.fastq -r1_bc r1_bc_file.txt -r2_bc -r2_bc_file.txt --check_headers
BARCODE FILES:
The forward barcode file, as you might expect, is the file containing the forward barcodes and letter (or whatever the barcode corresponds to). You will probably need to make your own forward barcode file, and the formatting is important. The easieast way to make one, is to open a new sheet in excel and copy/paste the barcode and letter pairs with the first column being the letter and the second containing the barcode. It should look like this: A GGTAC B CAACAC C ATCGGTT D TCGGTCAA NOTE: Both barcode files do not have any header information. I’m not sure if it will screw things up, but it’s safest to not include it. Once you’re done inputing the forward barcodes, save the excel sheet as a ‘tab delineated text’ file. Be sure to name it something so you know what it is. The reverse barcode file is just like the forward barcode file, except for the reverse barcodes. It should look like this: 1 AGGAA 2 GAGTGG 3 CCACGTC 4 TTCTCAGC Once you’ve followed the same process as with the forward barcode file, save it as a ‘tab delineated text’ file. NOTE: This program reverse complements (RC) the reverse barcodes instead of RCing the whole sequence. This makes it go faster, but if your reverse barcodes do not need to be RC'd, pass the --no_rev_comp_reverse_bcs arguement. OUTPUT: The output file/files contain reads that have been demultiplexed, and optionally trimmed of primers. This program automatically removes the barcode from each read and will remove the primers if length is specified by the --forward_primer_len and --reverse_primer_len arguments. Since indels are rare with illumina sequencing, the primers are not actually matched and are instead removed by trimming a set number of bases after the barcode. For example, if the forward primer was 17 bp long, and the reverse 15, pass --forward_primer_len 17 --reverse_primer_len 15 to remove it from each read. If these options are not passed, it defaults to 0. This program can also create individual fastq files split by each sample. To do this, simply pass the --individual_fastq argument. The -o arguement will become the name of the output directory with one fastq per sample. The dir structure is as followd:
-output_dir/
<sample_id_1>.fastq
<sample_id_2>.fastq
.
.
.
<sample_id_n>.fastq
pe_demuxer will only output individual fastq files. The R1 and R2 files are seperated into two folders each containing individual fastq files split by sample. This is the structure:
-output_dir/
R1/
<sample_id_1>_R1.fastq
<sample_id_2>_R1.fastq
.
.
.
<sample_id_n>_R1.fastq
R2/
<sample_id_1>_R2.fastq
<sample_id_2>_R2.fastq
.
.
.
<sample_id_n>_R1.fastq
There are number of different options for the output file. These include: HEADER OPTIONS: -Demultiplexed ID only- --FASTA Format-- >A6_1 AATATT....CACGCAGA --FASTQ Format-- @A6_1 AATATT....CACGCAGA + GGGGGG...@GGFGGFC< -Original header with demultiplexed ID- --FASTA Format-- >M02849:...:1886 A6_1 AATATT....CACGCAGA --FASTQ Format-- @M02849:...:1886 A6_1 AATATT....CACGCAGA + GGGGGG...@GGFGGFC< PROGRAM INFO: This program was written and predominantly tested using a macbook pro 2011 with 8 GB memory and a 2.4 GHz Intel Core i7 processor running OS X El Captitan using python 2.7.10. With this setup, both merged (written to single or individual files) and paired-end demultiplexing of ~7,000,000 reads took around 3-4 minutes. References: Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., . . . de Hoon, M. J. (2009). Bio$ python: freely available $ python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423. doi:10.1093/bioinformatics/btp163
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Mr_Demuxy-1.0.8.beta.tar.gz
(48.5 kB
view details)
File details
Details for the file Mr_Demuxy-1.0.8.beta.tar.gz
.
File metadata
- Download URL: Mr_Demuxy-1.0.8.beta.tar.gz
- Upload date:
- Size: 48.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddec24371c7bc182a534c01575be35a1ffe4f60ef692cbfd8b2cf542a38c8761 |
|
MD5 | 8ac93647966b3fd1d6ac42d7e202691c |
|
BLAKE2b-256 | bfd23af4dfab76486b7d390dd2ef5834e436e443012f485d5903da95c3835ade |