Skip to main content

cash in on expressed barcode tags

Project description

Forks Stargazers Issues MIT License PYPI Conda Docker

Pycashier

Tool for extracting and processing DNA barcode tags from Illumina sequencing.

Default parameters are designed for use by the Brock Lab to process data generated from ClonMapper lineage tracing experiments, but is extensible to other similarly designed tools.

Bioconda Dependencies

  • cutadapt (sequence extraction)
  • starcode (sequence clustering)
  • fastp (merging/quality filtering)
  • pysam (sam file conversion to fastq)

Pip/conda-forge Dependencies

Installation

It's recommended to use conda/mamba to install and manage the dependencies for this package.

conda install -c conda-forge -c bioconda cutadapt fastp pysam starcode pycashier

You can also use the included env.yml to create your environment and install everything you need.

conda env create -f https://raw.githubusercontent.com/brocklab/pycashier/main/env.yml
conda activate cashierenv

Additionally you may install with pip. Though it will be up to you to ensure all the dependencies you would install from bioconda are on your path and installed correctly. Pycashier will check for them before running.

pip install pycashier

Docker

If you prefer not to install pycashier locally you can also use docker.

docker run --rm -it -v $PWD:/data -u $(id -u):$(id -g) daylinmorgan/pycashier

Usage

As of v0.3.0 the interface of pycashier has changed. Previously a positional argument was used to indicate the source directory and additional flags would set the operation. Now pycashier uses click and a series of commands.

As always use though use pycashier --help and additionally pycashier <COMMAND> --help for the full list of parameters.

See below for a brief explanation of each command.

Extract

The primary use case of pycashier is extracting 20bp sequences from illumina generated fastq files. This can be accomplished with the below command where ./fastqs is a directory containing all of your fastq files.

pycashier extract -i ./fastqs

Pycashier will attempt to extract file names from your .fastq files using the first string delimited by a period.

For example:

  • sample1.fastq: sample1
  • sample2.metadata_pycashier.will.ignore.fastq: sample2

As pycashier extract runs, two directories will be generated ./pipeline and ./outs, configurable with -p/--pipeline and -o/--output respectively.

Your pipeline directory will contain all files and data generated while performing barcode extraction and clustering. While outs will contain a single .tsv for each sample with the final barcode counts.

Expected output of pycashier extract:

fastqs
└── sample.raw.fastq
pipeline
├── qc
│   ├── sample.html
│   └── sample.json
├── sample.q30.barcode.fastq
├── sample.q30.barcodes.r3d1.tsv
├── sample.q30.barcodes.tsv
└── sample.q30.fastq
outs
└── sample.q30.barcodes.r3d1.min176_off1.tsv

NOTE: If you wish to provide pycashier with fastq files containing only your barcode you can supply the --skip-trimming flag.

Merge

In some cases your data may be from paired-end sequencing. If you have two fastq files per sample that overlap on the barcode region they can be combined with pycashier merge. that overlap on the barcode region they can be combined with pycashier merge.

pycashier merge -i ./fastqgz

By default your output will be in mergedfastqs. Which you can then pass back to pycashier with pycashier extract -i mergedfastqs.

For single read, files are <sample>.fastq now they should both contain R1 and R2 and additionally may be gzipped.

For example:

  • sample.raw.R1.fastq.gz,sample.raw.R2.fastq.gz: sample
  • sample.R1.fastq,sample.R2.fastq: sample
  • sample.fastq: fail, not R1 and R2

Scrna

If your DNA barcodes are expressed and detectable in 10X 3'-based transcriptomic sequencing, then you can extract these tags with pycashier and their associated umi/cell barcodes from the cellranger output.

For pycashier scrna we extract our reads from sam files. This file can be generated using the output of cellranger count. For each sample you would run:

samtools view -f 4 $CELLRANGER_COUNT_OUTPUT/sample1/outs/possorted_genome_bam.bam > sams/sample1.unmapped.sam

This will generate a sam file containing only the unmapped reads.

Then similar to normal barcode extraction you can pass a directory of these unmapped sam files to pycashier and extract barcodes. You can also still specify extraction parameters that will be passed to cutadapt as usual.

Note: The default parameters passed to cutadapt are unlinked adapters and minimum barcode length of 10 bp.

pycashier scrna -i sams

When finished the outs directory will have a .tsv containing the following columns: Illumina Read Info, UMI Barcode, Cell Barcode, gRNA Barcode

Combine

This command can be used if you wish to generate a combined tsv from all files including headers and sample information. By default it uses ./outs for input and ./combined.tsv for output.

Config File

As of v0.3.1 you may generate and supply pycashier with a toml config file using -c/--config. The expected structure is each command followed by key value pairs of flags with hypens replaced by underscores:

[extract]
input = "fastqs"
threads = 10
unqualified_percent = 20

[merge]
input = "rawfastqgzs"
output = "mergedfastqs"
fastp_args = "-t 1"

The order of precedence for arguments is command line > config file > defaults.

For example if you were to use the above pycashier.toml with pycashier extract -c pycashier.toml -t 15. The value used for threads would be 15. You can confirm the parameter values as they will be printed prior to any execution.

For convenience, you can update/create your config file with pycasher COMMAND --save-config [explicit|full].

"Explicit" will only save parameters already included in the config file or specified at runtime. "Full" will include all parameters, again, maintaining preset values in config or specified at runtime.

Non-Configurable Defaults

See below for the non-configurable flags provided to external tools in each command. Refer to their documentation regarding the purpose of these flags.

Extract

  • fastp: --dont_eval_duplication
  • cutadapt: --max-n=0 -n 2

Merge

  • fastp: -m -c -G -Q -L

Scrna

  • cutadapt: --max-n=0 -n 2

Usage notes

Pycashier will NOT overwrite intermediary files. If there is an issue in the process, please delete either the pipeline directory or the requisite intermediary files for the sample you wish to reprocess. This will allow the user to place new fastqs within the source directory or a project folder without reprocessing all samples each time.

  • If there are reads from multiple lanes they should first be concatenated with cat sample*R1*.fastq.gz > sample.R1.fastq.gz
  • Naming conventions:
    • Sample names are extracted from files using the first string delimited with a period. Please take this into account when naming sam or fastq files.
    • Each processing step will append information to the input file name to indicate changes, again delimited with periods.

Acknowledgments

Cashier is a tool developed by Russell Durrett for the analysis and extraction of expressed barcode tags. This version like it's predecessor wraps around several command line bioinformatic tools to pull out expressed barcode tags.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycashier-22.6.2.tar.gz (19.7 kB view hashes)

Uploaded Source

Built Distribution

pycashier-22.6.2-py3-none-any.whl (21.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page