A package for analyzing nanopore sequencing data, focusing on clustering raw reads based on methylation and k-mers.

Project description

NanoMethCluster

NanoMethCluster is a Python package designed for the analysis of nanopore sequencing example that includes methylation information. The package includes tools for analyzing BAM files, extracting methylation information, processing k-mer example, general example analysis, and plotting. It provides a command-line interface (CLI) for performing various analyses and creating interactive plots.

Installation

Preparing the environment:

pip install pandas numpy pysam joblib matplotlib scikit-learn umap-learn plotly

Or use conda:

conda create -n nanomethcluster python=3.10 pandas numpy pysam joblib matplotlib scikit-learn umap-learn plotly

conda activate nanomethcluster

You can install NanoMethCluster using pip:

pip install NanoMethCluster

Or from source:

git clone https://github.com/alekseizarubin/NanoMethCluster.git
cd NanoMethCluster
pip install .

Usage

NanoMethCluster provides several command-line tools:

BAMReadAnalyzer

Preparing input BAM files

dorado basecaller sup,5mCG_5hmCG --min-qscore 10 <pod5> > <bam>

dorado demux --kit-name <KIT> --output-dir <output_dir> <bam>

BAMReadAnalyzer

Analyze BAM files to extract read information and save to a TSV file with additional metaexample.

NanoMethCluster BAMReadAnalyzer <bam_file> <output_file> [options]

Options:

-k, --kmer_length (int): Length of k-mers to count (2 or 3). Set to 0 to disable counting. (default: 0)
--include_mlbc: Include ML_B_C_Values in the output.
--total_ml: Process ML_B_C_Values to sum paired values.
--meth_calc (str): Method of calculation to use (default: base). Options are base or probability.
--num_simulations (int): Number of simulations for probability calculation. (default: 1000)
--seed (int): Seed for random number generator. (default: 42)
-t, --threads (int): Number of threads for parallel processing. (default: 1)

KmerDataProcessor

Process k-mer example from a TSV file.

NanoMethCluster KmerDataProcessor <input_file> <output_file> [options]

Options:

--min_length (int): Minimum length threshold for filtering.
--min_quality (float): Minimum quality threshold for filtering.
--norm (str): Normalization method: n (none), c (complementary), r (reverse), or cr (complementary and reverse). (default: n)

DataAnalyzer

Analyze example from TSV files and perform PCA, t-SNE, UMAP analyses, and K-means clustering.

NanoMethCluster DataAnalyzer <file_path> -o <output> [options]

Options:

-s, --sep (str): Column separator for input file (default is tab). (default: )
--pca (str): Perform PCA analysis with specified number of components or "auto" for default.
--tsne (int): Perform t-SNE analysis with specified number of dimensions (default: 2). (default: 2)
--umap (int): Perform UMAP analysis with specified number of dimensions (default: 2). (default: 2)
-k, --kmeans (int): Perform K-means clustering and determine optimal number of clusters up to specified maximum.

PlotGenerator

Generate interactive Plotly graphs from analysis results and create matrices.

NanoMethCluster PlotGenerator <input_dir> [options]

Options:

--pca (int): PCA components to visualize (specify 2 or 3 components).
--tsne (int): t-SNE components to visualize (specify 2 or 3 components).
--umap (int): UMAP components to visualize (specify 2 or 3 components).
-k, --kmeans (int): Number of clusters to use for coloring the plots.
--additional_file (str): Additional TSV or TSV.GZ file with extra example.
--kraken (str): Kraken2 TSV file with tab delimiter and no header, to use for additional coloring.
--matrix: Flag to create matrices for k-means clusters and additional example or kraken example.

Examples

Example 1: Analyze a BAM file

NanoMethCluster BAMReadAnalyzer sample.bam output.tsv.gz -k 3 --include_mlbc --total_ml --meth_calc base --num_simulations 1000 --threads 4

This command analyzes the sample.bam file, counts 3-mers, includes ML_B_C values, processes them by summing paired values, uses base calculation method for methylation, performs 1000 simulations, and uses 4 threads for parallel processing.

Example 2: Process k-mer example

NanoMethCluster KmerDataProcessor input.tsv.gz output.tsv.gz --min_length 1000 --min_quality 30.0 --norm cr

This command processes the k-mer example from input.tsv.gz, filters reads with a minimum length of 1000 and a minimum quality of 30.0, and normalizes k-mers by their complementary and reverse sequences.

Example 3: Perform example analysis

NanoMethCluster DataAnalyzer example.tsv.gz -o results --pca auto --tsne 2 --umap 2 --kmeans 10

This command analyzes the example.tsv.gz file, performs PCA with an automatic number of components, performs t-SNE and UMAP analyses with 2 dimensions each, and determines the optimal number of clusters up to 10 for K-means clustering.

Example 4: Generate interactive plots

NanoMethCluster PlotGenerator results --pca 0 1 --tsne 0 1 --umap 0 1 --kmeans 3 --additional_file additional_example.tsv.gz --kraken kraken.tsv --matrix

This command generates interactive Plotly plots from the analysis results in the results directory, visualizes the first two PCA components, the first two t-SNE components, and the first two UMAP components, colors the plots by 3 clusters, uses additional_example.tsv.gz for extra example, uses kraken.tsv for additional coloring, and creates matrices for k-means clusters and additional example.

Data Directory

The example directory is included for storing any sample example or input files required for running the analyses. Ensure that the necessary example files are placed in this directory before running the commands.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions or suggestions, please open an issue or contact [Aleksei Zarubin] at [a.a.zarubin@gmail.com].

Project details

Release history Release notifications | RSS feed

This version

0.1.1

May 23, 2024

0.1.0

May 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanomethcluster-0.1.1.tar.gz (47.0 MB view hashes)

Uploaded May 23, 2024 Source

Built Distribution

nanomethcluster-0.1.1-py2.py3-none-any.whl (15.9 kB view hashes)

Uploaded May 23, 2024 Python 2 Python 3

Hashes for nanomethcluster-0.1.1.tar.gz

Hashes for nanomethcluster-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e97ede2a356409911e00b70e51e7ab87d6ce7fe0a2119d1d81fa7b02a69f3ce4`
MD5	`55558b60e6d5359f9e94e017df637392`
BLAKE2b-256	`4c7745bbf3c8674e6f97b3c592aba9a7f7c1831edbd441e88c0c1183b403e9aa`

Hashes for nanomethcluster-0.1.1-py2.py3-none-any.whl

Hashes for nanomethcluster-0.1.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`1bf077d573baedbc9f835bfa7ccec5492e7b3cfa9bd2db14a3c71d2898ab9f19`
MD5	`b1a88eeb4a30295f0203fb5a4e617a3b`
BLAKE2b-256	`d82e2cc71c2666c29dd1bd77af8840476e85fa40dc873a3b425debbcd72d0943`