UMI de-duplication using mclUMI
Project description
tags: UMI deduplication
PCR deduplication
scRNA-seq
bulk-RNA-seq
Overview
This repository deposits the mclUMI toolkit developed by Markov clustering (MCL) network-based algorithms for precisely localizing unique UMIs and thus removing PCR duplicates. mclUMI enables a construction of sub-graphs with UMI nodes to be relatively strongly connected.
Documentation
The API documentation of mclUMI is available at https://mclumi.herokuapp.com and https://mclumi.readthedocs.io/en/latest.
System requirement
Linux or Mac
Installation
We tested the software installation on a Linux system, which has the following configuration:
- Distributor ID: Ubuntu
- Description: Ubuntu 20.04.3
- Release: 20.04
- Codename: focal
The anaconda is configured as:
- Conda version: 4.11.0
You can use
conda update conda
andconda update anaconda
to keep your anaconda up-to-date.
We recommend using a Python
of version 3.9.1
as the base python to create your conda environment because NumPy
and Pandas
in a Python
of higher version 3.9
may require a few dependencies that are not included in the installation of mclUMI or make conflicts with existing packages.
Step 1: create a conda environment, e.g., mclumi
conda create --name mclumi python=3.9.1
conda activate mclumi
Step 2: sourced from https://pypi.org/project/mclumix.
pip install --upgrade mclumix
After a two-step installation procedure, you should see the following outputs.
Usage
To ease the use of mclUMI for multiple groups of users, we have made it usable in both command-line interface (CLI) and inline mode.
1. CLI
1.1 Parameter illustration
By typing mclumi -h
, you are able to see the package usage as shown below.
usage: mclumi [-h] [--read_structure read_structure] [--lens lens]
[--input input] [--output output] [--method method]
[--input_bam input_bam] [--edit_dist edit dist]
[--inflation_value inflation_value]
[--expansion_value expansion_value]
[--iteration_number iteration_number]
[--mcl_fold_thres mcl_fold_thres] [--is_sv is_sv]
[--output_bam output_bam] [--verbose verbose]
[--pos_tag pos_tag] [--gene_assigned_tag gene_assigned_tag]
[--gene_is_assigned_tag gene_is_assigned_tag]
tool
Welcome to the mclumi toolkit
positional arguments:
tool trim, dedup_basic, dedup_pos, dedup_gene, dedup_sc
optional arguments:
-h, --help show this help message and exit
--read_structure read_structure, -rs read_structure
str - the read structure with elements in conjunction
with +, e.g., primer_1+umi_1+seq_1+umi_2+primer_2
--lens lens, -l lens str - lengths of all sub-structures separated by +,
e.g., 20+10+40+10+20 if the read structure is
primer_1+umi_1+seq_1+umi_2+primer_2
--input input, -i input
str - input a fastq file in gz format for trimming
UMIs
--output output, -o output
str - output a UMI-trimmed fastq file in gz format.
--method method, -m method
str - a dedup method: unique | cluster | adjacency |
directional | mcl | mcl_ed | mcl_val
--input_bam input_bam, -ibam input_bam
str - input a bam file curated by requirements of
different dedup modules: dedup_basic, dedup_pos,
dedup_gene, dedup_sc
--edit_dist edit dist, -ed edit dist
int - an edit distance used for building graphs at a
range of [1, l) where l is the length of a UMI
--inflation_value inflation_value, -infv inflation_value
float - an inflation value for MCL, 2.0 by default
--expansion_value expansion_value, -expv expansion_value
int - an expansion value for MCL at a range of (1,
+inf), 2 by default
--iteration_number iteration_number, -itern iteration_number
int - iteration number for MCL at a range of (1,
+inf), 100 by default
--mcl_fold_thres mcl_fold_thres, -fthres mcl_fold_thres
float - a fold threshold for MCL at a range of (1, l)
where l is the length of a UMI.
--is_sv is_sv, -issv is_sv
bool - to make sure if the deduplicated reads writes
to a bam file (True by default or False)
--output_bam output_bam, -obam output_bam
str - output UMI-deduplicated summary statistics to a
txt file.
--verbose verbose, -vb verbose
bool - to enable if output logs are on console (True
by default or False)
--pos_tag pos_tag, -pt pos_tag
str - to enable deduplication on the position tags (PO
recommended when your bam is tagged)
--gene_assigned_tag gene_assigned_tag, -gt gene_assigned_tag
str - to enable deduplication on the gene tag (XT
recommended)
--gene_is_assigned_tag gene_is_assigned_tag, -gist gene_is_assigned_tag
str - to check if reads are assigned the gene tag (XS
recommended)
1.2 Example commands
-
extracting and attaching umis to names of reads in fastq format
mclumi trim -i ./pcr_1.fastq.gz -o ./pcr_trimmed.fastq.gz -rs primer_1+umi_1+seq_1+umi_2+primer_2 -l 20+10+40+10+20
-
deduplication on only one genome position
mclumi dedup_basic -m mcl -ed 1 -infv 1.6 -expv 2 -ibam ./example_bundle.bam -obam ./dedup.bam
-
deduplication per genome position
mclumi dedup_pos -m mcl -pt PO -ed 1 -infv 1.6 -expv 2 -ibam ./example_bundle.bam -obam ./basic/dedup.bam
-
deduplication per gene (applicable to bulk RNA-seq data)
mclumi dedup_gene -m directional -gt XT -gist XS -ed 1 -ibam ./hgmm_100_STAR_FC_sorted.bam -obam ./dedup.bam
-
deduplication per cell per gene (applicable to single-cell RNA-seq data)
mclumi dedup_sc -m directional -gt XT -gist XS -ed 1 -ibam ./hgmm_100_STAR_FC_sorted.bam -obam ./dedup.bam
2. Inline
see Jupyter notebooks
./notebooks/
Output
see ./notebooks/results_spelt_out.ipynb
for result format. More types of output format are about to be added.
Contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mclumi-0.0.4.tar.gz
.
File metadata
- Download URL: mclumi-0.0.4.tar.gz
- Upload date:
- Size: 53.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.7 Linux/6.2.0-1018-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc3f4569cf8359b52e3c929f1e8c5e3be986c94afbc156fea44d6a529eadfa7e |
|
MD5 | 7d67f616f8b7349fed977349cae00ea1 |
|
BLAKE2b-256 | 58020bd072c7eed5f4cff6170695cadd411d9daa8743153562051bc0a90d5329 |
File details
Details for the file mclumi-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: mclumi-0.0.4-py3-none-any.whl
- Upload date:
- Size: 85.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.7 Linux/6.2.0-1018-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f34ae1c1b65d7a75d5257c5d6b5da1886bd7d8df99da326f02d02e65e6b51f00 |
|
MD5 | 8dc49fc0de09507996415d767f555c0b |
|
BLAKE2b-256 | aba027fd768d8b080456f1907b6c1a89f3a8710711720e35a96a3923c9fc8810 |