Automated sequence truncation algorithm

Project description

minEER - A sensible sequence trimming algorithm

For a given quality annotated read, the minEER algorithm identifies the subsequence which minimizes the expected error rate (EER; the mean q-score) while maximizing the subsequence length according to user defined EER and minimum sequence length thresholds. This procedure mimics the manual exercise of choosing truncation positions by glancing at a quality profile distribution, but ensures consistent results and is fast (can run on 20,000 files in under 15s). The minEER algorithm offers an improvement to current heuristic methods (sliding windows, quality score drop offs, etc.) which can be sensitive to noise and miss the opportunity for a deterministic solution based on reasonable criteria. The algorithm itself can be seen here.

The minEER pipeline (see documentation with mineer -h after installing) operates on an entire set of files from a single project. Assuming that all reads from a given direction (forward or reverse for paired reads) share a "similar" quality profile, the minEER pipeline runs the algorithm on a subsample of reads and to determine global truncation positions (where to start and end each read). All reads are then truncated according to these global positions and reads that fail to meet the user defined EER and minimum sequence lenght thresholds.

Install

Install with pip install mineer and then run mineer -h to view the input documentation and to test that installation worked properly.

Tutorial

After installing mineer, run the following:

# Download some fastq files to `sample_files/`
mineer-test-files
# Run the pipeline with default parameters (minimal acceptable error=.01)
mineer -i sample_files -f _1.fastq -r _2.fastq -o test_out -v sample_figs

Once you run the pipeline, a report of each step will appear as they execute. Files containing truncated reads will appear in the directory specified with -o. Providing the -v flag will produce visualizations like the following of quality profiles of untrimmed reads and the distribution of truncation positions identified by minEER: quality-profiles trunc-dist

If you just want to compute truncation positions without writing out truncated files (which can take a while), then you can run mineer without writing files by not providing an output directory (no - o argument):

mineer -i sample_files -f _1.fastq -r _2.fastq

To produce visualizations without writing out truncated files, you can provide a directory with -v (again, without -o):

mineer -i sample_files -f _1.fastq -r _2.fastq -v sample_figs

Note that these examples use default parameters, which can be inspected with mineer -h.

Pipeline

Method:

Ingest files and recognize pairs based on file names (using xopen)
Run minEER on subset of reads
Determine global truncation positions
Truncate all reads to global positions and filter out read pairs that don't pass QC (currently requires longest length)
Save truncated sequences
Produce visualizations, if visualization output directory provided

Contributing

Run tests with python -m unittest or pytest

Here is a longer list of SRRs to test on:

SRR9660346
SRR9660368
SRR9660375
SRR9660380
SRR9660372
SRR9660321
SRR9660322
SRR9660307
SRR9660387
SRR9660385

Project details

Release history Release notifications | RSS feed

This version

0.6.0

Aug 17, 2021

0.5.2

Aug 4, 2021

0.5.1

Aug 4, 2021

0.5.0

Aug 4, 2021

0.4.2

Jul 22, 2021

0.4.1

Jul 22, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineer-0.6.0.tar.gz (16.8 kB view details)

Uploaded Aug 17, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mineer-0.6.0-py3-none-any.whl (15.1 kB view details)

Uploaded Aug 17, 2021 Python 3

File details

Details for the file mineer-0.6.0.tar.gz.

File metadata

Download URL: mineer-0.6.0.tar.gz
Upload date: Aug 17, 2021
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.6

File hashes

Hashes for mineer-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`b2bccd269a4ee7ac777dced6055503db2237dec09e3a3de42dccc20704682b92`
MD5	`79da55282eed711e89cdf595f5982a5d`
BLAKE2b-256	`732cf44474b9bc21b18d476f26e504c631ff1cb6e4c0142f94838034fce0ecf9`

See more details on using hashes here.

File details

Details for the file mineer-0.6.0-py3-none-any.whl.

File metadata

Download URL: mineer-0.6.0-py3-none-any.whl
Upload date: Aug 17, 2021
Size: 15.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.6

File hashes

Hashes for mineer-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39a40544c2316e8450ba0a47b8e215881099ba9524f9c8abfa6bb6a206b8ecee`
MD5	`861d53fc30561ddf5500a9e3a6a2d57f`
BLAKE2b-256	`fe01d459af7ab8b85359264d3e87428a65a8f565397e2dcb8f5a8e7bf357bdc1`

See more details on using hashes here.

mineer 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

minEER - A sensible sequence trimming algorithm

Install

Tutorial

Pipeline

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes