Skip to main content

A simple variant finder for NGS data

Project description

vFind

PyPI - Version

A simple variant finder for NGS data.

  1. Introduction
  2. Installation
  3. Examples
  4. Contributing
  5. License

Introduction

vFind is unlike a traditional variant caller. It is actually using a simpler algorithm which is usually sufficient for screening experiments. The main use case is finding variants from a library that has constant adapter sequences flanking a variable region.

This simple algorithm is summarized as:

  1. Define a pair of adapter sequences that flank the variable region.
  2. For each fastq read, search for exact matches of these adapters.
  3. If both adapters are found exactly, recover the variable region.
  4. For each adapter without an exact match, perform semi-global alignment between the given adapter and read (optional see the alignment parameters section).
  5. If the alignment score meets a set threshold, that adapter is considered to match.
  6. If both adapters are exactly or partially matched, recover the variable region.
  7. For exact matches of both adapters, recover the variable region. Otherwise, continue to the next read.
  8. Finally, translate the variable region to its amino acid sequence and filter out any sequences with partial codons (Optional, see the miscellaneuous section).

[!WARNING] Note that vFind doesn't do any kind of preprocessing. For initial quality filtering, merging, and other common preprocessing operations, you might be interested in something like fastp or ngmerge. We generally recommend using fastp for quality filtering and merging fastq files before using vFind.

Installation details and usage examples are given below. For more usage details, please see the API reference

Installation

vFind is a Python package and can be installed via pip or nix. For a CLI version, see the vFind-cli repository.

PyPI (Recommended for most)

The package is available on PyPI and can be installed via pip (or alternatives like uv).

Below is an example using pip with Python3 in a new project.

# create a new virtual env
python3 -m venv .venv # create a new virtual env if haven't already
source .venv/bin/activate # activate the virtual env

python3 -m pip install vfind # install vfind

Nix

vFind is also available on NixPkgs. You can declare new enviroments using nix flakes.

For something quick, you can use nix-shell. For example, the following will create a new shell with Python 3.11, vFind, and polars installed.

nix-shell -p python311 python3Packages.vfind python3Packages.polars

Examples

Basic Usage

from vfind import find_variants
import polars as pl # variants are returned in a polars dataframe

adapters = ("GGG", "CCC") # define the adapters
fq_path = "./path/to/your/fastq/file.fq.gz" # path to fq file

variants = find_variants(fq_path, adapters)

# print the number of unique sequences 
print(variants.n_unique())

find_variants returns a polars dataframe with sequence and count columns. sequence contains the amino acid sequence of the variable regions and count contains the frequency of those variant.

We can then use dataframe methods to further analyze the recovered variants. Some examples are shown below.

# Get the top 5 most frequent variants
variants.sort("count", descending=True) # sort by the counts in descending order
print(variants.head(5)) # print the first 5 (most frequent) variants

# filter out sequences with less than 10 read counts
# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)

filtered_variants = variants.filter(
    variants["count"] > 10,
    ~variants["sequence"][::-2].str.contains("*")
)

# write the filtered variants to a csv file
filtered_variants.write_csv("filtered_variants.csv")

Using Custom Alignment Parameters

By default, vFind uses semi-global alignment with the following parameters:

  • match score = 3
  • mismatch score = -2
  • gap open penalty = 5
  • gap extend penalty = 2

Note that the gap penalties are represented as positive integers. This is largely due to how the underlying alignment library works.

To adjust these alignment parameters, use the match_score, mismatch_score, gap_open_penalty, and gap_extend_penalty keyword arguments:

from vfind import find_variants

# ... define adapters and fq_path

# use identity scoring with no gap penalties for alignments
variants = find_variants(
    fq_path,
    adapters,
    match_score = 1,
    mismatch_score = -1,
    gap_open_penalty: 0,
    gap_extend_penalty: 0,
)

Alignments are accepted if they produce a score above a set threshold. The threshold for considering an acceptable alignment can be adjusted with the accept_prefix_alignment and accept_suffix_alignment arguments. By default, both thresholds are set to 0.75.

The thresholds are represent a percentage of the maximum alignment score. So, a value of 0.75 means alignments producing scores that are greater than 75% the maximum theoretical score will be accepted. Thus, valid values are between 0 and 1.

Either an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant. In order to skip alignment and only look for exact matches, set the skip_alignment argument to True.

Miscellaneous

Q: I don't need the amino acid sequence. Can I just get the DNA sequence?

A: Yes. Just set skip_translation to True.

# ...
dna_seqs = find_variants(fq_path, adapters, skip_translation=True)

Q: I don't want to use polars. Can I use pandas instead?

A: Yes. Use the to_pandas method on the dataframe.


Q: I have a lot of data and find_variants is slow. Is there anything I can do to speed it up?

A: Maybe. Try changing the number of threads or queue length the function uses.

# ...
variants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)

For more usage details, see the API reference.

Contributing

Feedback is a gift and contributions are more than welcome. Please submit an issue or pull request for any bugs, suggestions, or feedback. Please see the developing guide for more details on how to work on vFind.

License

vFind is licensed under the MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vfind-0.1.1.tar.gz (32.6 kB view details)

Uploaded Source

Built Distributions

vfind-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

vfind-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

vfind-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.12 macOS 10.12+ x86-64

vfind-0.1.1-cp311-cp311-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

vfind-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.11 macOS 10.12+ x86-64

vfind-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

File details

Details for the file vfind-0.1.1.tar.gz.

File metadata

  • Download URL: vfind-0.1.1.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for vfind-0.1.1.tar.gz
Algorithm Hash digest
SHA256 55ec520af5b0d8e7940a2ad0f937d54a64428a1871cdc66ac7c4e5ced412cc07
MD5 0a4db8040a435f4ef691efd8324ce668
BLAKE2b-256 05df4a4291b8ecc76e257affa271d380efe2290dcc431d6a5fba5208a0fc6b12

See more details on using hashes here.

File details

Details for the file vfind-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e780581b780573197be05c7d4187cb46e8649b6d14b7b7fcaf987ea97e30d366
MD5 b9ae4879092346f1320602d3ec54ec0a
BLAKE2b-256 04cedeca03e6b5d3a5cacf10c6476c55e546854a892e366926539a18e099f0ff

See more details on using hashes here.

File details

Details for the file vfind-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vfind-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c0345988e9fe46b6d2567acf8704499ea127d957442a19b4b48f3681a342aefc
MD5 752c9b79f229983a7ea9ea3226b57d1e
BLAKE2b-256 c76457ec9c39f731a7bf2ed61a798b17e5d0cadc7e236a6700664abc5a056469

See more details on using hashes here.

File details

Details for the file vfind-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 cbfd7186b94216bd433bbb742089a6fef8504001935a589ed1d8ffa3e5ea4486
MD5 1a9a2abd45e1caf7d0898290b57e2a72
BLAKE2b-256 0748d5595b536d9b533edce049b4d17c4b5967d8b8dd8258363fd508b7ee4bc5

See more details on using hashes here.

File details

Details for the file vfind-0.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vfind-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f7949fcd9595113514b72f2f9c3dd5d6b64da364081893f13919e0fcd06a0c97
MD5 6e013e75436c2a23a056d1100208e8f2
BLAKE2b-256 e985231b54148e6cc35eb03f72aad849f1959936efee04f6180c1536d029b166

See more details on using hashes here.

File details

Details for the file vfind-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 25a8a1b214e03cf224fae66762643a9b38a6a8a9e8e2566e735c4a57862c3abb
MD5 3859f7ef5cbedd524ea48d5113602e84
BLAKE2b-256 c06d580cb81704abc2f249699d8cf81a079954b5ed3db46a89066722f0159041

See more details on using hashes here.

File details

Details for the file vfind-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5975054db02dd05077e27e41d4ea0bd2d6caafc3d5b75de09b78ba1b1997541a
MD5 90a7016684e436e3e132e4bae87b07f7
BLAKE2b-256 21b3638f55520f813c34721741a63e06a5bf123f43ccd42b9239ffea1d83f249

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page