Skip to main content

cash in on expressed barcode tags

Project description

Forks Stargazers Issues MIT License PYPI PyVersion Conda ghcr.io

Pycashier

demo
Tool for extracting and processing DNA barcode tags from Illumina sequencing.

Default parameters are designed for use by the Brock Lab to process data generated from ClonMapper lineage tracing experiments, but is extensible to other similarly designed tools.

Bioconda Dependencies

  • cutadapt (sequence extraction)
  • starcode (sequence clustering)
  • fastp (merging/quality filtering)
  • pysam (sam file conversion to fastq)

Pip/conda-forge Dependencies

Installation

Conda

You may use conda/mamba to install and manage the dependencies for this package.

conda install -c conda-forge -c bioconda cutadapt fastp pysam starcode pycashier

You can also use the included env.yml to create your environment and install everything you need.

conda env create -f https://raw.githubusercontent.com/brocklab/pycashier/main/conda/env.yml
conda activate cashierenv

Docker

If you prefer to use use docker you can use the below command.

docker run --rm -it -v $PWD:/data -u $(id -u):$(id -g) ghcr.io/brocklab/pycashier

Pip (Not Recommended)

You may install with pip. Though it will be up to you to ensure all the dependencies you would install from bioconda are on your path and installed correctly. Pycashier will check for them before running.

pip install pycashier

Usage

As of v0.3.0 the interface of pycashier has changed. Previously a positional argument was used to indicate the source directory and additional flags would set the operation. Now pycashier uses click and a series of commands.

As always use though use pycashier --help and additionally pycashier <COMMAND> --help for the full list of parameters.

See below for a brief explanation of each command.

Extract

The primary use case of pycashier is extracting 20bp sequences from illumina generated fastq files. This can be accomplished with the below command where ./fastqs is a directory containing all of your fastq files.

pycashier extract -i ./fastqs

Pycashier will attempt to extract file names from your .fastq files using the first string delimited by a period.

For example:

  • sample1.fastq: sample1
  • sample2.metadata_pycashier.will.ignore.fastq: sample2

As pycashier extract runs, two directories will be generated ./pipeline and ./outs, configurable with -p/--pipeline and -o/--output respectively.

Your pipeline directory will contain all files and data generated while performing barcode extraction and clustering. While outs will contain a single .tsv for each sample with the final barcode counts.

Expected output of pycashier extract:

fastqs
└── sample.raw.fastq
pipeline
├── qc
│   ├── sample.html
│   └── sample.json
├── sample.q30.barcode.fastq
├── sample.q30.barcodes.r3d1.tsv
├── sample.q30.barcodes.tsv
└── sample.q30.fastq
outs
└── sample.q30.barcodes.r3d1.min176_off1.tsv

NOTE: If you wish to provide pycashier with fastq files containing only your barcode you can supply the --skip-trimming flag.

Merge

In some cases your data may be from paired-end sequencing. If you have two fastq files per sample that overlap on the barcode region they can be combined with pycashier merge. that overlap on the barcode region they can be combined with pycashier merge.

pycashier merge -i ./fastqgz

By default your output will be in mergedfastqs. Which you can then pass back to pycashier with pycashier extract -i mergedfastqs.

For single read, files are <sample>.fastq now they should both contain R1 and R2 and additionally may be gzipped.

For example:

  • sample.raw.R1.fastq.gz,sample.raw.R2.fastq.gz: sample
  • sample.R1.fastq,sample.R2.fastq: sample
  • sample.fastq: fail, not R1 and R2

Scrna

If your DNA barcodes are expressed and detectable in 10X 3'-based transcriptomic sequencing, then you can extract these tags with pycashier and their associated umi/cell barcodes from the cellranger output.

For pycashier scrna we extract our reads from sam files. This file can be generated using the output of cellranger count. For each sample you would run:

samtools view -f 4 $CELLRANGER_COUNT_OUTPUT/sample1/outs/possorted_genome_bam.bam > sams/sample1.unmapped.sam

This will generate a sam file containing only the unmapped reads.

Then similar to normal barcode extraction you can pass a directory of these unmapped sam files to pycashier and extract barcodes. You can also still specify extraction parameters that will be passed to cutadapt as usual.

Note: The default parameters passed to cutadapt are unlinked adapters and minimum barcode length of 10 bp.

pycashier scrna -i sams

When finished the outs directory will have a .tsv containing the following columns: Illumina Read Info, UMI Barcode, Cell Barcode, gRNA Barcode

Combine

This command can be used if you wish to generate a combined tsv from all files including headers and sample information. By default it uses ./outs for input and ./combined.tsv for output.

Config File

As of v0.3.1 you may generate and supply pycashier with a toml config file using -c/--config. The expected structure is each command followed by key value pairs of flags with hypens replaced by underscores:

[extract]
input = "fastqs"
threads = 10
unqualified_percent = 20

[merge]
input = "rawfastqgzs"
output = "mergedfastqs"
fastp_args = "-t 1"

The order of precedence for arguments is command line > config file > defaults.

For example if you were to use the above pycashier.toml with pycashier extract -c pycashier.toml -t 15. The value used for threads would be 15. You can confirm the parameter values as they will be printed prior to any execution.

For convenience, you can update/create your config file with pycasher COMMAND --save-config [explicit|full].

"Explicit" will only save parameters already included in the config file or specified at runtime. "Full" will include all parameters, again, maintaining preset values in config or specified at runtime.

Non-Configurable Defaults

See below for the non-configurable flags provided to external tools in each command. Refer to their documentation regarding the purpose of these flags.

Extract

  • fastp: --dont_eval_duplication
  • cutadapt: --max-n=0 -n 2

Merge

  • fastp: -m -c -G -Q -L

Scrna

  • cutadapt: --max-n=0 -n 2

Usage notes

Pycashier will NOT overwrite intermediary files. If there is an issue in the process, please delete either the pipeline directory or the requisite intermediary files for the sample you wish to reprocess. This will allow the user to place new fastqs within the source directory or a project folder without reprocessing all samples each time.

  • If there are reads from multiple lanes they should first be concatenated with cat sample*R1*.fastq.gz > sample.R1.fastq.gz
  • Naming conventions:
    • Sample names are extracted from files using the first string delimited with a period. Please take this into account when naming sam or fastq files.
    • Each processing step will append information to the input file name to indicate changes, again delimited with periods.

Acknowledgments

Cashier is a tool developed by Russell Durrett for the analysis and extraction of expressed barcode tags. This version like it's predecessor wraps around several command line bioinformatic tools to pull out expressed barcode tags.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycashier-23.1.2.tar.gz (138.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycashier-23.1.2-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file pycashier-23.1.2.tar.gz.

File metadata

  • Download URL: pycashier-23.1.2.tar.gz
  • Upload date:
  • Size: 138.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for pycashier-23.1.2.tar.gz
Algorithm Hash digest
SHA256 ca7967a6210aae853ea081f02cc98f66c1df7989b32c03b9addd13e82d719fb4
MD5 856492be4d41a8f6e25acc17df007325
BLAKE2b-256 b1d75218c9a0a06c2bd3fac036a0ceeaf0a951f56525348906616d138e03f3ff

See more details on using hashes here.

File details

Details for the file pycashier-23.1.2-py3-none-any.whl.

File metadata

  • Download URL: pycashier-23.1.2-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for pycashier-23.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b82cf84b0d2ee1389eab79ba2206089ba2eaedfe0f55781ec25c6c4f431fafee
MD5 58489a244481f63b5f643805d44ae208
BLAKE2b-256 4cbbd73da6dba06c84569272b77e95037b31f9e72607beaacdf9fb0fcccb1ada

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page