rrequested-demultiplex: A versatile package to preprocess raw basecalled reads with quality and size filtering and ex-novo demultiplexing

These details have not been verified by PyPI

Project links

Development Status
- 2 - Pre-Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

rrequested-demultiplex

-General purpose and applicability

rrequested-demultiplex is the python package distribution of the modular shellscript tool RREQUESTED (Raw REads QUality Extraction, Size Trimming and Ex-novo Demultiplexing). It is aimed at preprocessing (quality filtering, size selection and demultiplexing) of raw basecalled reads, especially produced by Oxford Nanopore, a Third Generation Sequencing technology.

Even though the program has been tested only on Nanopore sequencing results, the program could be cautiously applied also to products from NGS or other ThirdGen technologies.

-Installation

The basic requirment for this installation is to have python 3.10 installed on your machine.

To install the package, just run:

pip install rrequested-demultiplex

You can retrieve general information on it by running:

pip show rrequested-demultiplex

-Functions and usage

Module	Class	Method	Description
file_system	FileSystem	`makedir_orchange`	Method returns the provided path after attempting to create the directory, handling the case where the directory already exists.
file_system	FileSystem	`get_base_dir`	Method extracts information such as the base directory, base name, and the base directory of the base directory from the provided file path, using the os.path.splitext function and string manipulation.
quality_filter	Quality_Filter	`load_data`	Method reads data from the input file, handling both gzipped and non-gzipped fastq formats.
quality_filter	Quality_Filter	`ascii_conv_and_mean`	Method converts ASCII characters to Phred quality scores and calculates the mean.
quality_filter	Quality_Filter	`filter`	Method performs the filtering based on the quality threshold and updates the input file with filtered reads, providing a summary of the process.
size_filter	Size_Filter	`load_data`	Method reads data from the input file, handling both gzipped and non-gzipped fasta/fastq formats.
size_filter	Size_Filter	`size_filter`	Method performs the size-based filtering and writes the selected sequences to a new file, providing a summary of the process.
unref_demultiplex	Unreferenced_demultiplexing	`find_the_num`	Method finds highly divergent sequences from the input file.
unref_demultiplex	Unreferenced_demultiplexing	`find_the_num_list`	Method finds highly divergent sequences from a given list of sequences.
unref_demultiplex	Unreferenced_demultiplexing	`demultiplex`	Method performs the demultiplexing process and writes the demultiplexed sequences into separate files.

Here are some example usages of the package:

Size_Filter

# Import necessary modules
from rrequested_demultiplex.size_filter import Size_Filter

# Specify input file, minimum size, and maximum size
input_file = "your_input_file.fastq"
min_size = 20
max_size = 200

# Create an instance of Size_Filter
size_filter_instance = Size_Filter(infile=input_file, minsize=min_size, maxsize=max_size)

# Load data from the input file
data = size_filter_instance.load_data()

# Perform size filtering
filtered_data = size_filter_instance.size_filter()

Quality_Filter

# Import necessary modules
from rrequested_demultiplex.quality_filter import Quality_Filter

# Specify input file and quality threshold
input_file = "your_input_file.fastq"
quality_threshold = 20

# Create an instance of Quality_Filter
quality_filter_instance = Quality_Filter(infile=input_file, quality=quality_threshold)

# Load data from the input file
data = quality_filter_instance.load_data()

# Perform quality filtering
quality_filter_instance.filter()

FileSystem

# Import necessary modules
from rrequested_demultiplex.file_system import FileSystem

# Create an instance of FileSystem
file_system_instance = FileSystem()

# Specify the path where you want to create a directory
directory_path = "/path/to/your/directory"

# Create a new directory or change to an existing one
base_directory = file_system_instance.makedir_orchange(directory_path)

# Get information about the base directory
base_dir, base_name, base_basedir = file_system_instance.get_base_dir(base_directory)

# Display the directory information
print("Base Directory:", base_dir)
print("Base Name (without extension):", base_name)
print("Base Directory of the Base Directory:", base_basedir)

Unreferenced_Demultiplexing

# Import necessary modules
from rrequested_demultiplex.unreferenced_demultiplexing import Unreferenced_Demultiplexing

# Specify input file
input_file = "input_file.fasta"

# Create an instance of Unreferenced_Demultiplexing
demultiplex_instance = Unreferenced_Demultiplexing(infile=input_file)

# Find highly divergent sequences
highly_divergent_seqs = demultiplex_instance.find_the_num()

# Demultiplex the sequences
demultiplex_instance.demultiplex()

# The demultiplexed sequences will be written to individual files in a new folder
# The folder will be named based on the input file, in this example: "input_file-fa-demultiplexed"

-How does it work?

The quality filtering method is based on the easiest implementation one could think of: for every read, the filtering algorithms takes the mean quality and discards the reads that are under a given value (default is 7, so this step will take place nevertheless if the file is fastq/fastq.gz)
The size filtering method is also based on the easiest implementation one could think of: for every read, the filtering algorithms takes the length and, if this is below the minimum or above the maximum allowed, the read gets discarded.
The demultiplexing method is based on super-fast global alignment and it is divided into two main parts: in the first, the demultiplexer identifies unique (higly divergent) reads, that are the ones which score less than 50% in similarity with all the other sequences. After that, it globaly aligns all the raw reads against the "self-made reference", grouping the ones that share more than 70% of their code. After having demultiplexed this way, the program checks the leftovers, to see wether there are worthy-to-save data or not. This brings to a five-round cycle that identifies higly divergent sequences in the non-grouped ones and clusters the "nogroup" reads against them. If there are still ungrouped remainders after this step, they get clustered together (3). Only groups encompassing more than 1% of the total reads will be written as demultiplexed fasta files, named N.fasta (where N is a number) or nogroup.fasta if they belong to the unclustered reads: they could be found in the folder basefilename-extensionabbreviation-demultiplexed (an example could be: if you are demultiplexing a file named coleoptera.fastq, the folder will be coleoptera-fq-demultiplexed)

-Final considerations

As a practical suggestion, we strongly advise to be cautious while using rrequested-demultiplex with files containing reads from multiple individuals, especially if the quality of the data is low: not because there is the risk that you will miss something, but because it can produce more groups than needed.

Moreover, please note that rrequested-demultiplex is still experimental and may contain errors, may fail/take really long while performing large analyses with limited computational power (e.g. on a normal laptop) and may output not-100%-reliable results, so always check them and pull issues whenever you feel it to be the case, we'll be on your back as soon as possible to fix/implement/enhance whatever you suggest!

-License and rights of usage

The code is distributed under the MIT license.

The MIT License is a straightforward and permissive license that encourages collaboration and widespread use. It allows developers the freedom to build upon and share code while providing clear guidelines for responsible and respectful use. Whether you are an individual or a corporation, the MIT License promotes an open and collaborative approach to software development.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 2 - Pre-Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.2.0

Dec 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rrequested-demultiplex-1.2.0.tar.gz (12.9 kB view details)

Uploaded Dec 24, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rrequested_demultiplex-1.2.0-py3-none-any.whl (12.5 kB view details)

Uploaded Dec 24, 2023 Python 3

File details

Details for the file rrequested-demultiplex-1.2.0.tar.gz.

File metadata

Download URL: rrequested-demultiplex-1.2.0.tar.gz
Upload date: Dec 24, 2023
Size: 12.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for rrequested-demultiplex-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8f3dfafa9172f94637f9336d1ca91f82716888b5f4c03de7f1fa6267f2a5487b`
MD5	`7033894c6ad3b7798146561c922e36e4`
BLAKE2b-256	`9f0fe7060e3dfd90ada3adfcdf9fae7168d894f362d7f87f67162dbe74f96b5a`

See more details on using hashes here.

File details

Details for the file rrequested_demultiplex-1.2.0-py3-none-any.whl.

File metadata

Download URL: rrequested_demultiplex-1.2.0-py3-none-any.whl
Upload date: Dec 24, 2023
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for rrequested_demultiplex-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`478be7d7910c00d8a6f8d889697860304ace73c4fee0c1a1251528d231ce04a7`
MD5	`2c4bed457b33aae9540ec825446ba794`
BLAKE2b-256	`1195d3189dbb7d08d188a00c2c2926119275acbb6d481d71a2a447d6a4774d66`

See more details on using hashes here.

rrequested-demultiplex 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rrequested-demultiplex

-General purpose and applicability

-Installation

-Functions and usage

-How does it work?

-Final considerations

-License and rights of usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes