cash in on expressed barcode tags

Project description

Pycashier

Tool for extracting and processing DNA barcode tags from Illumina sequencing.

Default parameters are designed for use by the Brock Lab to process data generated from ClonMapper lineage tracing experiments, but is extensible to other similarly designed tools.

Bioconda Dependencies

cutadapt (sequence extraction)
starcode (sequence clustering)
fastp (merging/quality filtering)
pysam (sam file conversion to fastq)

Pip/conda-forge Dependencies

Installation

Conda

You may use conda/mamba to install and manage the dependencies for this package.

conda install -c conda-forge -c bioconda cutadapt fastp pysam starcode pycashier

You can also use the included env.yml to create your environment and install everything you need.

conda env create -f https://raw.githubusercontent.com/brocklab/pycashier/main/conda/env.yml
conda activate cashierenv

Docker

If you prefer to use use docker you can use the below command.

docker run --rm -it -v $PWD:/data -u $(id -u):$(id -g) ghcr.io/brocklab/pycashier

Pip (Not Recommended)

You may install with pip. Though it will be up to you to ensure all the dependencies you would install from bioconda are on your path and installed correctly. Pycashier will check for them before running.

pip install pycashier

Usage

As of v0.3.0 the interface of pycashier has changed. Previously a positional argument was used to indicate the source directory and additional flags would set the operation. Now pycashier uses click and a series of commands.

As always use though use pycashier --help and additionally pycashier <COMMAND> --help for the full list of parameters.

See below for a brief explanation of each command.

Extract

The primary use case of pycashier is extracting 20bp sequences from illumina generated fastq files. This can be accomplished with the below command where ./fastqs is a directory containing all of your fastq files.

pycashier extract -i ./fastqs

Pycashier will attempt to extract file names from your .fastq files using the first string delimited by a period.

For example:

sample1.fastq: sample1
sample2.metadata_pycashier.will.ignore.fastq: sample2

As pycashier extract runs, two directories will be generated ./pipeline and ./outs, configurable with -p/--pipeline and -o/--output respectively.

Your pipeline directory will contain all files and data generated while performing barcode extraction and clustering. While outs will contain a single .tsv for each sample with the final barcode counts.

Expected output of pycashier extract:

fastqs
└── sample.raw.fastq
pipeline
├── qc
│   ├── sample.html
│   └── sample.json
├── sample.q30.barcode.fastq
├── sample.q30.barcodes.r3d1.tsv
├── sample.q30.barcodes.tsv
└── sample.q30.fastq
outs
└── sample.q30.barcodes.r3d1.min176_off1.tsv

NOTE: If you wish to provide pycashier with fastq files containing only your barcode you can supply the --skip-trimming flag.

Merge

In some cases your data may be from paired-end sequencing. If you have two fastq files per sample that overlap on the barcode region they can be combined with pycashier merge. that overlap on the barcode region they can be combined with pycashier merge.

pycashier merge -i ./fastqgz

By default your output will be in mergedfastqs. Which you can then pass back to pycashier with pycashier extract -i mergedfastqs.

For single read, files are <sample>.fastq now they should both contain R1 and R2 and additionally may be gzipped.

For example:

sample.raw.R1.fastq.gz,sample.raw.R2.fastq.gz: sample
sample.R1.fastq,sample.R2.fastq: sample
sample.fastq: fail, not R1 and R2

Scrna

If your DNA barcodes are expressed and detectable in 10X 3'-based transcriptomic sequencing, then you can extract these tags with pycashier and their associated umi/cell barcodes from the cellranger output.

For pycashier scrna we extract our reads from sam files. This file can be generated using the output of cellranger count. For each sample you would run:

samtools view -f 4 $CELLRANGER_COUNT_OUTPUT/sample1/outs/possorted_genome_bam.bam > sams/sample1.unmapped.sam

This will generate a sam file containing only the unmapped reads.

Then similar to normal barcode extraction you can pass a directory of these unmapped sam files to pycashier and extract barcodes. You can also still specify extraction parameters that will be passed to cutadapt as usual.

Note: The default parameters passed to cutadapt are unlinked adapters and minimum barcode length of 10 bp.

pycashier scrna -i sams

When finished the outs directory will have a .tsv containing the following columns: Illumina Read Info, UMI Barcode, Cell Barcode, gRNA Barcode

Combine

This command can be used if you wish to generate a combined tsv from all files including headers and sample information. By default it uses ./outs for input and ./combined.tsv for output.

Config File

As of v0.3.1 you may generate and supply pycashier with a toml config file using -c/--config. The expected structure is each command followed by key value pairs of flags with hypens replaced by underscores:

[extract]
input = "fastqs"
threads = 10
unqualified_percent = 20

[merge]
input = "rawfastqgzs"
output = "mergedfastqs"
fastp_args = "-t 1"

The order of precedence for arguments is command line > config file > defaults.

For example if you were to use the above pycashier.toml with pycashier extract -c pycashier.toml -t 15. The value used for threads would be 15. You can confirm the parameter values as they will be printed prior to any execution.

For convenience, you can update/create your config file with pycasher COMMAND --save-config [explicit|full].

"Explicit" will only save parameters already included in the config file or specified at runtime. "Full" will include all parameters, again, maintaining preset values in config or specified at runtime.

Non-Configurable Defaults

See below for the non-configurable flags provided to external tools in each command. Refer to their documentation regarding the purpose of these flags.

Extract

fastp: --dont_eval_duplication
cutadapt: --max-n=0 -n 2

Merge

fastp: -m -c -G -Q -L

Scrna

cutadapt: --max-n=0 -n 2

Usage notes

Pycashier will NOT overwrite intermediary files. If there is an issue in the process, please delete either the pipeline directory or the requisite intermediary files for the sample you wish to reprocess. This will allow the user to place new fastqs within the source directory or a project folder without reprocessing all samples each time.

If there are reads from multiple lanes they should first be concatenated with cat sample*R1*.fastq.gz > sample.R1.fastq.gz
Naming conventions:
- Sample names are extracted from files using the first string delimited with a period. Please take this into account when naming sam or fastq files.
- Each processing step will append information to the input file name to indicate changes, again delimited with periods.

Acknowledgments

Cashier is a tool developed by Russell Durrett for the analysis and extraction of expressed barcode tags. This version like it's predecessor wraps around several command line bioinformatic tools to pull out expressed barcode tags.

Project details

Release history Release notifications | RSS feed

2024.1007

Oct 29, 2024

2024.1006

Sep 30, 2024

2024.1005

Sep 19, 2024

2024.1004

Apr 3, 2024

2024.1003

Mar 21, 2024

2024.1002

Feb 22, 2024

2024.1001

Feb 21, 2024

This version

23.1.2

Jan 7, 2023

23.1.1

Jan 5, 2023

22.10.1

Oct 21, 2022

22.9.1

Sep 11, 2022

22.6.2

Jun 30, 2022

22.6.1

Jun 27, 2022

0.3.5

Jun 17, 2022

0.3.4

Jun 7, 2022

0.3.3

May 23, 2022

0.3.2

Apr 7, 2022

0.3.1

Mar 29, 2022

0.3.0

Mar 16, 2022

0.2.8

Jan 25, 2022

0.2.6

Dec 9, 2021

0.2.5

Dec 9, 2021

0.2.4

Dec 8, 2021

0.2.3

Dec 8, 2021

0.2.2

Dec 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycashier-23.1.2.tar.gz (138.0 kB view details)

Uploaded Jan 7, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycashier-23.1.2-py3-none-any.whl (29.3 kB view details)

Uploaded Jan 7, 2023 Python 3

File details

Details for the file pycashier-23.1.2.tar.gz.

File metadata

Download URL: pycashier-23.1.2.tar.gz
Upload date: Jan 7, 2023
Size: 138.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for pycashier-23.1.2.tar.gz
Algorithm	Hash digest
SHA256	`ca7967a6210aae853ea081f02cc98f66c1df7989b32c03b9addd13e82d719fb4`
MD5	`856492be4d41a8f6e25acc17df007325`
BLAKE2b-256	`b1d75218c9a0a06c2bd3fac036a0ceeaf0a951f56525348906616d138e03f3ff`

See more details on using hashes here.

File details

Details for the file pycashier-23.1.2-py3-none-any.whl.

File metadata

Download URL: pycashier-23.1.2-py3-none-any.whl
Upload date: Jan 7, 2023
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for pycashier-23.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b82cf84b0d2ee1389eab79ba2206089ba2eaedfe0f55781ec25c6c4f431fafee`
MD5	`58489a244481f63b5f643805d44ae208`
BLAKE2b-256	`4cbbd73da6dba06c84569272b77e95037b31f9e72607beaacdf9fb0fcccb1ada`

See more details on using hashes here.

pycashier 23.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Pycashier

Bioconda Dependencies

Pip/conda-forge Dependencies

Installation

Conda

Docker

Pip (Not Recommended)

Usage

Extract

Merge

Scrna

Combine

Config File

Non-Configurable Defaults

Extract

Merge

Scrna

Usage notes

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes