TileSeqMut

Analysis scriptsTileSeqMut for TileSeq sequencing data

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.7
- Python :: 3.8
Topic
- Software Development :: Build Tools

Project description

TileSeq mutation count package

This package is made to parse input sequecning files (fastq) with user provided parameter.csv file. Output of this pipeline is mutation counts for each pair of fastq files.

Dependencies

python 3.7/3.8 (tested mainly under py3.7)

R 3.4.4+

Bowtie2 Bowtie2-build

Installation

Please use conda to set up the environment before installing the package:

conda install -n <env_name>

You will also need the script csv2json.R which can be installed via installing tileseqMave. Make sure csv2json.R can be found in $PATH

The alpha version is available by running:

python -m pip install TileSeqMut

To update to the newest stable release:

python -m pip install TileSeqMut==0.4.201

Execution

After installation, you can run the package:

tileseq_mut -p ~/path/to/paramSheet.csv -o ~/path/to/output_folder -f ~/path/to/fastq_file_folder/ -name
 name_of_the_run

Examples:

# on DC
tileseq_mut -p $HOME/dev/tilseq_mutcount/190506_param_MTHFR.csv -o $HOME/dev/tilseq_mutcount/output/ -f $HOME
/tileseq_data/WT/ -name MTHFR_test

# on BC2
tileseq_mut -p $HOME/dev/tilseq_mutcount/190506_param_MTHFR.csv -o $HOME/dev/tilseq_mutcount/output/ -f $HOME
/tileseq_data/WT/ -name MTHFR_test -env BC2

This command will analyze fastq files in the folder: ~/tileseq_data/WT/ and make a time stamped output folder with the prefix: MTHFR_test in $HOME/dev/tilseq_mutcount/output/ (Using all default parameters, see below)

Parameters

Run tileseq_mut --help

(py37) [rli@dc06 DC_jobs]$ tileseq_mut -h
usage: tileseq_mut [-h] [-f FASTQ] -o OUTPUT -p PARAM -n NAME
                   [--skip_alignment] [-r1 R1] [-r2 R2] [-log LOG_LEVEL]
                   [-env ENVIRONMENT] [-at AT] [-mt MT] [-c C] [-b BASE]
                   [-test] [-rc] [-override]

TileSeq mutation counts

optional arguments:
  -h, --help            show this help message and exit
  -f FASTQ, --fastq FASTQ
                        Path to all fastq files you want to analyze
  -o OUTPUT, --output OUTPUT
                        Output folder
  -p PARAM, --param PARAM
                        csv paramter file
  -n NAME, --name NAME  Name for this run
  --skip_alignment      skip alignment for this analysis, ONLY submit jobs for
                        counting mutations in existing output folder
  -r1 R1                r1 SAM file
  -r2 R2                r2 SAM file
  -log LOG_LEVEL, --log_level LOG_LEVEL
                        set log level: debug, info, warning, error, critical.
                        (default = info)
  -env ENVIRONMENT, --environment ENVIRONMENT
                        The cluster used to run this script (default = DC)
  -at AT                Alignment time (default = 8h)
  -mt MT                Mutation call time (default = 48h)
  -c C                  Number of cores to use for mutation counting
  -b BASE, --base BASE  ASCII code base
  -test                 Turn on test mode
  -rc                   Turn on rc mode, both direction of the reads will be
                        aligned to the reference. Variant calling will be
                        performed on all the reads that are aligned, regardless of their direction (BE
                        CAREFUL!)
  -override, --sr_Override
                        Provide this argument when there is only one replicate

Start the run

Once the run starts, it will first submit alignment jobs to the cluster and keep tracking of all the submitted alignment jobs. Once all the jobs are finished, the pipeline will submit another batch of jobs for mutation calling.
if you want to skip alignment and only do mutation calls for existing sam files you can run the following command:
Example of skipping alignment:

tileseq_mut -p $HOME/dev/tilseq_mutcount/190506_param_MTHFR.csv -o /home/rothlab1/rli/dev/tilseq_mutcount/output
/190506_MTHFR_WT_2020-01-29-17-07-04/ --skip_alignment

Input files

/path/to/fastq/ - Full path to input fastq files

parameters.csv - CSV file contains information for this run (please see example here ). This file is required to be comma-seperated and saved in csv format.

Output files

One output folder is created for each run. The output folder are named with name_time-stamp

Within each output folder, the following files and folders will be generated:

./main.log - main logging file for alignment

./args.log - arguments for this run

./ref/ - Reference fasta file and bowtie2 index

./env_aln_sh/ - Bash scripts for submitting the alignment jobs

./sam_files/ - Alignment output and log files for the raw fastq files

./name_time-stamped_mut_count/ - Mutation counts in each sample are saved in csv files

- `./main.log` - Main log file for mutation calling

- `./args.log` - command line arguments

- `./info.csv` - Meta information for each sample: sequencing depth, tile starts/ends and # of reads mapped outside of the targeted tile

- `./count_sample_*.csv` - Raw mutation counts for each sample. With meta data in header. Variants are represented in hgvs format

- `./env_mut/` - Bash scripts for summitting the mutation count jobs, also log files for each sample.

The count_sample_**.csv is passed to tileseqMave for further analysis

Alignment

The pipeline takes the sequence in the parameter file as reference and align the fastq files to the whole reference sequence. This is the sequence specified by user in the parameter file.

For each pair of fastq files (R1 and R2), the pipeline submits one alignment job to the cluster. In the folder env_sh you can find all the scripts that were submitted to the cluster when you run main.py.

Alignments were done using Bowtie2 with following parameters:

~/bowtie2 --no-head --norc --no-sq --local -x {ref} -U {r1} -S {r1_sam_file}
~/bowtie2 --no-head --nofw --no-sq --local -x {ref} -U {r2} -S {r2_sam_file}

Mutation Calls

From each pair of sam files we count mutations for each sample.

We first filter out reads that did not map to reference or reads that are outside of the tile. Then pass the rest of the reads to count_mut.py. Please read the wiki page about how to call mutations using CIGAR string and MD:Z tag.

In order to eliminate sequencing errors. We apply a posterior probability cut-off. The posterior probability of a mutation was calculated using the Phred scores provided in SAM files.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.7
- Python :: 3.8
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

0.6.945

Nov 6, 2023

0.6.944

Aug 2, 2023

0.6.943

Apr 14, 2023

0.6.942

Mar 30, 2023

0.6.941

Jan 12, 2023

0.6.940

Oct 26, 2022

0.6.930

Sep 30, 2022

0.6.920

Jul 6, 2022

0.6.910

Nov 3, 2021

0.6.900

Oct 27, 2021

0.6.890

Oct 7, 2021

0.6.880

Sep 30, 2021

0.6.870

Sep 30, 2021

0.6.860

Sep 9, 2021

0.6.850

Aug 3, 2021

0.6.840

Jul 30, 2021

0.6.830

Jul 27, 2021

0.6.820

Jul 26, 2021

0.6.810

Jul 19, 2021

0.6.800

Jul 15, 2021

0.6.730

Jul 15, 2021

0.6.710

Jul 9, 2021

0.6.700

Jul 9, 2021

0.6.600

Jun 30, 2021

0.6.510

Jun 24, 2021

0.6.500

Jun 23, 2021

0.6.408

Jun 3, 2021

0.6.407

Jun 3, 2021

0.6.406

Jun 3, 2021

0.6.405

Jun 3, 2021

0.6.404

Jun 3, 2021

0.6.403

Jun 3, 2021

0.6.402

May 18, 2021

0.6.401

May 17, 2021

0.6.4

May 14, 2021

0.6.3

May 12, 2021

0.6.2

May 4, 2021

0.5.907

Apr 21, 2021

0.5.905

Apr 13, 2021

0.5.904

Apr 13, 2021

0.5.903

Apr 13, 2021

0.5.902

Apr 13, 2021

0.5.901

Apr 12, 2021

0.5.9

Apr 9, 2021

This version

0.5.7

Mar 26, 2021

0.5.5

Mar 11, 2021

0.4.216

Mar 2, 2021

0.4.207

Feb 22, 2021

0.4.206

Feb 19, 2021

0.4.205

Feb 19, 2021

0.4.204

Feb 18, 2021

0.4.203

Feb 18, 2021

0.4.201

Feb 11, 2021

0.4.15

Jan 20, 2021

0.3.903

Jan 14, 2021

0.3.902

Jan 14, 2021

0.3.901

Jan 14, 2021

0.3.9

Jan 14, 2021

0.3.8

Jan 13, 2021

0.2.2

Nov 27, 2020

0.1.16

Oct 6, 2020

0.1.15

Oct 6, 2020

0.1.14

Oct 6, 2020

0.1.13

Oct 6, 2020

0.1.12

Oct 6, 2020

0.1.11

Sep 24, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TileSeqMut-0.5.7.tar.gz (41.6 kB view hashes)

Uploaded Mar 26, 2021 Source

Built Distribution

TileSeqMut-0.5.7-py3-none-any.whl (71.0 kB view hashes)

Uploaded Mar 26, 2021 Python 3

Hashes for TileSeqMut-0.5.7.tar.gz

Hashes for TileSeqMut-0.5.7.tar.gz
Algorithm	Hash digest
SHA256	`d108b49bdca37ab3507d2c24c74115e4fea547b339a4a686b0ee154316f8d099`
MD5	`8d23d67ce36ae1b322529a03c46c16c2`
BLAKE2b-256	`a55a67b9b7a04ec2a6561bade4235135b5de8c0bb546f261dcaf8313f44cf3a4`

Hashes for TileSeqMut-0.5.7-py3-none-any.whl

Hashes for TileSeqMut-0.5.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3011334e6c87a662d876c0df02d0f6a990a3dfab1b6a598552f239ef9b27067`
MD5	`4a0d1c9aa0ca3bc4d8c081cf1f2be82e`
BLAKE2b-256	`acac5ec75f733a384c6198243c8bd14a60c534278f36c710b3473bd224ff626f`