cash in on expressed barcode tags
Project description
Pycashier
Tool for extracting and processing DNA barcode tags from Illumina sequencing.
Default parameters are designed for use by the Brock Lab to process data generated from ClonMapper lineage tracing experiments, but is extensible to other similarly designed tools.
Bioconda Dependencies
- cutadapt (sequence extraction)
- starcode (sequence clustering)
- fastp (merging/quality filtering)
- pysam (sam file conversion to fastq)
Pip/conda-forge Dependencies
Installation
It's recommended to use conda/mamba to install and manage the dependencies for this package.
conda install -c conda-forge -c bioconda cutadapt fastp pysam starcode pycashier
You can also use the included env.yml
to create your environment and install everything you need.
conda env create -f https://raw.githubusercontent.com/brocklab/pycashier/main/env.yml
conda activate cashierenv
Additionally you may install with pip. Though it will be up to you to ensure all the
dependencies you would install from bioconda are on your path and installed correctly.
Pycashier
will check for them before running.
pip install pycashier
Docker
If you prefer not to install pycashier
locally you can also use docker
.
docker run --rm -it -v $PWD:/data -u $(id -u):$(id -g) daylinmorgan/pycashier
Usage
As of v0.3.0
the interface of pycashier
has changed. Previously a positional argument was used to indicate the source directory and additional flags would set the operation.
Now pycashier
uses click
and a series of commands.
As always use though use pycashier --help
and additionally pycashier <COMMAND> --help
for the full list of parameters.
See below for a brief explanation of each command.
Extract
The primary use case of pycashier is extracting 20bp sequences from illumina generated fastq files.
This can be accomplished with the below command where ./fastqs
is a directory containing all of your fastq files.
pycashier extract -i ./fastqs
Pycashier
will attempt to extract file names from your .fastq
files using the first string delimited by a period.
For example:
sample1.fastq
: sample1sample2.metadata_pycashier.will.ignore.fastq
: sample2
As pycashier extract
runs, two directories will be generated ./pipeline
and ./outs
, configurable with -p/--pipeline
and -o/--output
respectively.
Your pipeline
directory will contain all files and data generated while performing barcode extraction and clustering.
While outs
will contain a single .tsv
for each sample with the final barcode counts.
Expected output of pycashier extract
:
fastqs
└── sample.raw.fastq
pipeline
├── qc
│ ├── sample.html
│ └── sample.json
├── sample.q30.barcode.fastq
├── sample.q30.barcodes.r3d1.tsv
├── sample.q30.barcodes.tsv
└── sample.q30.fastq
outs
└── sample.q30.barcodes.r3d1.min176_off1.tsv
NOTE: If you wish to provide pycashier
with fastq files containing only your barcode you can supply the --skip-trimming
flag.
Merge
In some cases your data may be from paired-end sequencing. If you have two fastq files per sample
that overlap on the barcode region they can be combined with pycashier merge
.
that overlap on the barcode region they can be combined with pycashier merge
.
pycashier merge -i ./fastqgz
By default your output will be in mergedfastqs
. Which you can then pass back to pycashier
with pycashier extract -i mergedfastqs
.
For single read, files are <sample>.fastq
now they should both contain R1 and R2 and additionally may be gzipped.
For example:
sample.raw.R1.fastq.gz
,sample.raw.R2.fastq.gz
: samplesample.R1.fastq
,sample.R2.fastq
: samplesample.fastq
: fail, not R1 and R2
Scrna
If your DNA barcodes are expressed and detectable in 10X 3'-based transcriptomic sequencing,
then you can extract these tags with pycashier
and their associated umi/cell barcodes from the cellranger
output.
For pycashier scrna
we extract our reads from sam files.
This file can be generated using the output of cellranger count
.
For each sample you would run:
samtools view -f 4 $CELLRANGER_COUNT_OUTPUT/sample1/outs/possorted_genome_bam.bam > sams/sample1.unmapped.sam
This will generate a sam file containing only the unmapped reads.
Then similar to normal barcode extraction you can pass a directory of these unmapped sam files to pycashier and extract barcodes. You can also still specify extraction parameters that will be passed to cutadapt as usual.
Note: The default parameters passed to cutadapt are unlinked adapters and minimum barcode length of 10 bp.
pycashier scrna -i sams
When finished the outs
directory will have a .tsv
containing the following columns: Illumina Read Info, UMI Barcode, Cell Barcode, gRNA Barcode
Combine
This command can be used if you wish to generate a combined tsv from all files including headers and sample information.
By default it uses ./outs
for input and ./combined.tsv
for output.
Config File
As of v0.3.1
you may generate and supply pycashier
with a toml config file using -c/--config
.
The expected structure is each command followed by key value pairs of flags with hypens replaced by underscores:
[extract]
input = "fastqs"
threads = 10
unqualified_percent = 20
[merge]
input = "rawfastqgzs"
output = "mergedfastqs"
fastp_args = "-t 1"
The order of precedence for arguments is command line > config file > defaults.
For example if you were to use the above pycashier.toml
with pycashier extract -c pycashier.toml -t 15
.
The value used for threads would be 15.
You can confirm the parameter values as they will be printed prior to any execution.
For convenience, you can update/create your config file with pycasher COMMAND --save-config [explicit|full]
.
"Explicit" will only save parameters already included in the config file or specified at runtime. "Full" will include all parameters, again, maintaining preset values in config or specified at runtime.
Non-Configurable Defaults
See below for the non-configurable flags provided to external tools in each command. Refer to their documentation regarding the purpose of these flags.
Extract
fastp
:--dont_eval_duplication
cutadapt
:--max-n=0 -n 2
Merge
fastp
:-m -c -G -Q -L
Scrna
cutadapt
:--max-n=0 -n 2
Usage notes
Pycashier will NOT overwrite intermediary files. If there is an issue in the process, please delete either the pipeline directory or the requisite intermediary files for the sample you wish to reprocess. This will allow the user to place new fastqs within the source directory or a project folder without reprocessing all samples each time.
- If there are reads from multiple lanes they should first be concatenated with
cat sample*R1*.fastq.gz > sample.R1.fastq.gz
- Naming conventions:
- Sample names are extracted from files using the first string delimited with a period. Please take this into account when naming sam or fastq files.
- Each processing step will append information to the input file name to indicate changes, again delimited with periods.
Acknowledgments
Cashier is a tool developed by Russell Durrett for the analysis and extraction of expressed barcode tags. This version like it's predecessor wraps around several command line bioinformatic tools to pull out expressed barcode tags.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pycashier-22.6.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 870f6ebb6ff61a3218a60ccc4f57bb76232496eaa31ea4f9d68352fe8db0ef97 |
|
MD5 | 720a2ac9fa7e1b461107470f1ef3467f |
|
BLAKE2b-256 | b44758e1293629e262d1df04d1b65e4289c0d6b39e29463f0283370c5189a56e |