A tool for identification of adapters and other sequence patterns in the next generation sequencing (NGS) data

These details have not been verified by PyPI

Project links

Project description

FastContext

Logo

PyPI PyPI - Python Version PyPI - Status PyPI - Downloads GitHub issues GitHub last commit (branch) GitHub Keybase PGP

Description

FastContext is a tool for identification of adapters and other sequence patterns in the next generation sequencing (NGS) data. The algorithm parses FastQ files (in a single-end or paired-end mode), searches read or read pair for user-specified patterns, and then generates a human-readable representation of the search results, which we call "read structure". Also FastContext gathers statistics on frequency of occurence for each read structure.

Installation

python3 -m pip install FastContext

Check installation:

FastContext --help

Usage

Optional arguments:

-1, --r1: Required.
Format: String
Description: FastQ input R1 file. May be uncompressed, gzipped or bzipped.
Usage: -1 input.fastq.gz
-p, --patterns: Required.
Description: Patterns to look for. The order of patterns is the order of search.
Format: Plain Javascript Object String (Key-Value). Names must contain 2-24 Latin and numeric symbols, and -_-, sequences must contain more than one symbols ATGC.
Usage: -p '{"First": "CTCAGCGCTGAG", "Second": "AAAAAA", "Third": "GATC"}'
-s, --summary: Required.
Description: Output HTML file. Contains statistics summary in human-readable form.
Format: String
Usage: -s statistics.htm
-2, --r2: Description: FastQ input R2 file. May be uncompressed, gzipped or bzipped. If single-end mode, ignore this option.
Format: String
Usage: -2 input_R2.fastq.gz
-j, --json: Description: Output JSON.GZ file (gzipped JSON). Contains extended statistics on pattern sequences, each read or read pair: read structure, Levenshtein distances (see -l option).
Format: String
Usage: -j statistics.json.gz
-k, --kmer-size: Description: Max size of unrecognized sequence to be written as K-mer of certain length.
Format: Non-negative Integer
Default: 0
Usage: -k 9
-u, --unrecognized: Description: Long unrecognized sequences replacement.
Format: 2-24 Latin and numeric symbols, and -_-
Default: unknown
Usage: -u genome
-m, --max-reads: Description: Max reads number to analyze (0 -- no limit). Notice that read number bigger than recommended may cause memory overflow.
Format: Non-negative Integer
Default: 1000000
Usage: -m 1000
-f, --rate-floor: Description: Min rate to write read structure into statistics TSV table.
Format: Float from 0 to 1
Default: 0.001
Usage: -f 0.1
-@, --threads: Description: Threads number.
Format: Non-negative integer less than 2 * cpu_count()
Default: cpu_count()
Usage: -@ 10
-d, --dont-check-read-names: Description: Don't check read names. Use this if you have unusual (non-Illumina) paired read names. Makes sense only in paired-end mode.
Usage: -d
-l, --levenshtein: Description: Calculate patterns Levenshtein distances for each position in read. Results are written into extended statistics file (JSON.GZ). Notice that it highly increases the time of processing.
Usage: -l
-h, --help: Description: Show help message and exit.
Usage: -h
-v, --version: Description: Show program's version number and exit.
Usage: -v

Examples

Summary statistics table

Contains counts, percentage and read structures. Length of K-mer or pattern strand (Forward or Reverse) is displayed after the comma.

Example:

R1

Count	Percentage	Read Structure
5,197	48.807	{unknown}
3,297	30.963	{unknown}--{oligme:F}--{oligb:F}--{701:F}--{unknown}
114	1.070	{unknown}--{oligb:F}--{701:F}--{unknown}
71	0.666	{unknown}--{oligme:F}--{unknown}
69	0.648	{unknown}--{oligme:F}--{unknown}--{701:F}--{unknown}
60	0.563	{unknown}--{oligme:F}--{oligb:F}--{701:F}--{kmer:14bp}

R2

Count	Percentage	Read Structure
7,545	70.858	{unknown}
616	5.785	{unknown}--{oligme:F}--{oliga:R}--{502:R}--{unknown}
540	5.071	{unknown}--{oligme:F}--{unknown}
441	4.141	{unknown}--{oligme:F}--{oliga:R}--{unknown}
298	2.798	{unknown}--{oliga:R}--{unknown}
263	2.469	{unknown}--{502:R}--{unknown}
233	2.188	{unknown}--{oligme:F}--{kmer:14bp}--{502:R}--{unknown}
163	1.530	{unknown}--{oliga:R}--{502:R}--{unknown}
56	0.525	{unknown}--{502:F}--{unknown}

Extended statistics JSON.GZ file

Contains extended statistics: run options, performance, pattern analysis, full summary without rate floor, each read analysis. Example is shorten.

{
	"FastQ": {
		"R1": "tests/standard_test_R1.fastq.gz",
		"R2": "tests/standard_test_R2.fastq.gz"
	},
	"RunData": {
		"Read Type": "Paired-end",
		"Max Reads": 100,
		"Rate Floor": 0.001
	},
	"Performance": {
		"Reads Analyzed": 100,
		"Threads": 4,
		"Started": "2022-07-13T18:15:48.277660",
		"Finished": "2022-07-13T18:15:48.964721"
	},
	"PatternsData": {
		"PatternsList": {
			"oligme": {
				"F": "CTGTCTCTTATACACATCT",
				"R": "AGATGTGTATAAGAGACAG",
				"Length": 19
			},
			"s502": {
				"F": "CTCTCTAT",
				"R": "ATAGAGAG",
				"Length": 8
			}
		},
		"PatternsAnalysis": [
			{
				"Analysis": "reverse complement only",
				"FirstPattern": "oligme",
				"SecondPattern": "oligme",
				"FirstLength": 19,
				"SecondLength": 19,
				"LevenshteinAbsolute": 11,
				"LevenshteinSimilarity": 0.42105263157894735,
				"Type": "good",
				"Risk": "low"
			},
			{
				"Analysis": "full",
				"FirstPattern": "oligme",
				"SecondPattern": "s502",
				"FirstLength": 19,
				"SecondLength": 8,
				"LevenshteinAbsolute": 2,
				"LevenshteinSimilarity": 0.75,
				"Type": "nested",
				"Risk": "medium"
			}
		],
		"Other": {
			"Unrecognized Sequence": "unknown",
			"K-mer Max Size": 15
		}
	},
	"Summary": {
		"R1": {
			"{unknown}--{oligme:F}--{oligb:F}--{s701:F}--{unknown}": {
				"Count": 34,
				"Percentage": 34.0,
				"ReadStructure": [
					{ "type": "unrecognized" },
					{ "type": "pattern", "name": "oligme", "strand": "F" },
					{ "type": "pattern", "name": "oligb", "strand": "F" },
					{ "type": "pattern", "name": "s701", "strand": "F" },
					{ "type": "unrecognized" }
				]
			},
			"{unknown}--{oligme:F}--{unknown}--{s701:F}--{unknown}": [ "..." ],
			"{unknown}--{s701:F}--{unknown}": [ "..." ]
		},
		"R2": [ "..." ]
	},
	"RawDataset": [
		{
			"Name": "M02435:112:000000000-DFC9M:1:1101:14970:1484",
			"R1": {
				"Sequence": "ACCTAGAAGAGCCAAAAGACTCT...AATCTCGTATGCCGTCT",
				"PhredQual": [29,32,32,33,33,37,37,37,37,"...",38,38,38,13],
				"Levenshtein": [
					{
						"name": "oligme",
						"strand": "F",
						"length": 19,
						"values": [14,14,12,13,12,12,12,"...",NaN,NaN,NaN]
					},
					{
						"name": "oligme",
						"strand": "R",
						"length": 19,
						"values": [12,11,10,9,9,9,10,10,"...",NaN,NaN,NaN]
					}
				],
				"ReadStructure": [
					{ "type": "unrecognized" },
					{ "type": "pattern", "name": "oligme", "strand": "F" },
					{ "type": "pattern", "name": "oligb", "strand": "F" },
					{ "type": "pattern", "name": "s701", "strand": "F" },
					{ "type": "unrecognized" }
				],
				"TextReadStructure": "{unknown}--{oligme:F}--{oligb:F}--{s701:F}--{unknown}"
			},
			"R2": "..." 
		}
	]
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2022.8.30

Aug 30, 2022

2022.8.8.1

Aug 7, 2022

2022.8.8

Aug 7, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastcontext-2022.8.30-py3-none-any.whl (26.1 kB view details)

Uploaded Aug 30, 2022 Python 3

File details

Details for the file fastcontext-2022.8.30-py3-none-any.whl.

File metadata

Download URL: fastcontext-2022.8.30-py3-none-any.whl
Upload date: Aug 30, 2022
Size: 26.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for fastcontext-2022.8.30-py3-none-any.whl
Algorithm	Hash digest
SHA256	`915f70f3427ee4493f90b3df64b922b2686b09bc69b4567bda89aa9328970168`
MD5	`1e97acde2771f23c663267638b6bcb1c`
BLAKE2b-256	`1e4bfd08ae87c1f431a5fdaf86a22ed639a6cfc8454de17574109dd3a87faf4c`

See more details on using hashes here.

FastContext 2022.8.30

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FastContext

Description

Installation

Usage

Examples

Summary statistics table

R1

R2

Extended statistics JSON.GZ file

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes