Analysis for plasmid pool sequencing data
Project description
Dependencies
Build reference
- The pre-built reference files used for the analysis can be found in
- human grch37:
/home/rothlab/rli/02_dev/06_pps_pipeline/fasta/human_ensembl/grch37
- human grch38:
/home/rothlab/rli/02_dev/06_pps_pipeline/fasta/human_ensembl/grch38
- human 9.1:
/home/rothlab/rli/02_dev/06_pps_pipeline/fasta/human_91
- yeast (palte specific):
/home/rothlab/rli/02_dev/06_pps_pipeline/fasta/yeast_ref_all
- human grch37:
- If you need to build new references, please make sure:
- Name for the reference is the same as name for the sequencing files. For example, the corresponding reference for
scORFeome-HIP-05_L001.fastq.gz
isscORFeome-HIP-05
- ID for each sequence matches the ORF-id in the summary file
- Name for the reference is the same as name for the sequencing files. For example, the corresponding reference for
Make summary file
- The summary files for human and yeast are premade before running the pipeline, the raw data can be found:
/home/rothlab/rli/02_dev/06_pps_pipeline/target_orfs/human_summary.csv
and/home/rothlab/rli/02_dev/06_pps_pipeline/target_orfs/yeast_summary.csv
- If you are making your own summary file, make sure you have a column with the name
orf_name
, which is the unique identifier for each ORF, this should also map with the sequence names in the fasta file you make. You can modify inmain.py: analysisHuman or analysisYeast
to select columns you want to keep
Input FASTQ files
- FASTQ files:
- human (files from the same group are merged together):
/home/rothlab/rli/01_ngsdata/PPS_data/Human_pool/merged_pool9-1/
- yeast (files from the same plate are merged together):
/home/rothlab/rli/01_ngsdata/PPS_data/yeast_pps_fastq/yeast_pps_fastq/
- human (files from the same group are merged together):
Install and Run
-
install the package using: ``
usage: pps [-h] [--align] [-f FASTQ] [-n NAME] -o OUTPUT -r REF [--refName REFNAME] [--summaryFile SUMMARYFILE] [--orfseq ORFSEQ] Plasmid pool sequencing analysis required arguments: -f FASTQ, --fastq FASTQ path to fastq files -o OUTPUT, --output OUTPUT Output directory -r REF, --ref REF Path to reference -m MODE, --mode MODE human or yeast --summaryFile SUMMARYFILE Yeast or Human summary file optional arguments: -h, --help show this help message and exit --align provide this argument if users want to start with alignment, otherwise the program assumes alignment was done and will analyze the vcf files. -n NAME, --name NAME Run name (default set to pps) --refName REFNAME grch37, grch38, cds_seq. Required if mode == human -l LOG, --log LOG logging mode, default set to info
-
Example: Human (with alignment to grch37)
pps -f ~/01_ngsdata/PPS_data/Human_pool/merged_pool9-1/ -o ../../output/ -n Human91 --refName human91 --summaryFile ../../target_orfs/human_summary.csv -m human -r ../../fasta/human_91/ --align
-
Yeast
pps -f ~/01_ngsdata/PPS_data/yeast_pps_fastq/yeast_pps_fastq/ -o ../../output/ -n testpackYeast --summaryFile ../../target_orfs/yeast_summary.csv -m yeast -r ../../fasta/yeast_ref_all/
-
The pipeline first submit alignment jobs to the cluster (slurm), after all the jobs are done, it filters vcf files, output summary and mutations
Output
- All the intermediate files will be saved into your output directory, a new folder will be made with the
-n
parameter - For each fastq file, a folder will be made. It contains the following files:
*.sh
: alignment job script used for alignmentall_summary_plateORFs.csv
: summary for this plate/group*.log
: log file*_raw.vcf
: raw vcf file generated from pileup*_variants.vcf
: vcf file with variants only*_filtered.vcf
: filtered vcf file
- After the run is finished, the following files will be generated in the master output folder:
alignment_log.csv
: shows the alignment rate for each plate/groupall_mutations.csv
: contains all the variants passed filterall_summary.csv
: contains all ORFs and if they were found/fully covered in the sequencinggenes_stats.csv
: overall stats
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file PlasmidPoolAnalysis-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: PlasmidPoolAnalysis-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe235f0f1f9c86a915c95e455701c2167be0f1624340db9224130425bea3f7bb |
|
MD5 | 7157b36ee10009dfcdecbf81aada8067 |
|
BLAKE2b-256 | 1b733fd39862dfe4377bb965f66b9d9c5ca5956373fd9c76a01b2088f094685f |