Skip to main content

Simple variant finding utilities for NGS data

Project description

vFind

PyPI - Version

A simple variant finder for NGS data.

  1. Introduction
  2. Installation
  3. Examples
  4. Contributing
  5. License

Introduction

vFind is unlike a traditional variant caller. It is actually using a simpler algorithm which is usually sufficient for screening experiments. The main use case is finding variants from a library that has constant adapter sequences flanking a variable region.

This simple algorithm is summarized as:

  1. Define a pair of adapter sequences that flank the variable region.
  2. For each fastq read, search for exact matches of these adapters.
  3. If both adapters are found exactly, recover the variable region.
  4. For each adapter without an exact match, perform semi-global alignment between the given adapter and read (see the alignment parameters section for more details).
  5. If the alignment score meets a set threshold, that adapter is considered to match.
  6. If both adapters are exactly or partially matched, recover the variable region.
  7. For exact matches of both adapters, recover the variable region. Otherwise, continue to the next read.
  8. Finally, translate the variable region to its amino acid sequence and filter out any sequences with partial codons (see the miscellaneous section for more details).

[!WARNING] Note that vFind doesn't do any kind of preprocessing. For initial quality filtering, merging, and other common preprocessing operations, you might be interested in something like fastp or ngmerge. We generally recommend using fastp for quality filtering and merging fastq files before using vFind.

Installation details and usage examples are given below. For more usage details, please see the API reference

Installation

The package is available on PyPI and can be installed via pip (or alternatives like uv).

PyPI (Recommended)

Below is an example using uv to initialize a project and add vfind as a dependency.

uv init
uv add vfind

and with pip after creating and activating a new virtual environment

python3 -m venv .venv
source .venv/bin/activate

python3 -m pip install vfind

Source

vFind is developed using PyO3 and Rust. You will need to make sure you have a Rust toolchain installed as well as standard C tooling to build some dependencies (i.e., parasail-rs crate).

  1. Clone the repository
git clone https://github.com/nsbuitrago/vfind.git
cd vfind
  1. Inside the vfind directory, sync dependencies with uv and build vfind
uv sync

# this will build and install vfind in the virtual env
uv run maturin develop --uv # or `make dev`

Examples

Basic Usage

from vfind import find_variants
import polars as pl # variants are returned in a polars dataframe

adapters = ("GGG", "CCC") # define the adapters
fq_path = "./path/to/your/fastq/file.fq.gz" # path to fq file

variants = find_variants(fq_path, adapters)

# print the number of unique sequences
print(variants.n_unique())

find_variants returns a polars dataframe with sequence and count columns. sequence contains the amino acid sequence of the variable regions and count contains the frequency of those variant.

We can then use dataframe methods to further analyze the recovered variants. Some examples are shown below.

# Get the top 5 most frequent variants
variants.sort("count", descending=True) # sort by the counts in descending order
print(variants.head(5)) # print the first 5 (most frequent) variants

# filter out sequences with less than 10 read counts
# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)

filtered_variants = variants.filter(
    variants["count"] > 10,
    ~variants["sequence"][::-2].str.contains("*")
)

# write the filtered variants to a csv file
filtered_variants.write_csv("filtered_variants.csv")

Using Custom Alignment Parameters

By default, vFind uses semi-global alignment with the following parameters:

  • match score = 3
  • mismatch score = -2
  • gap open penalty = 5
  • gap extend penalty = 2

Note that the gap penalties are represented as positive integers. This is largely due to how the underlying alignment library works.

To adjust these alignment parameters, use the match_score, mismatch_score, gap_open_penalty, and gap_extend_penalty keyword arguments:

from vfind import find_variants

# ... define adapters and fq_path

# use identity scoring with no gap penalties for alignments
variants = find_variants(
    fq_path,
    adapters,
    match_score = 1,
    mismatch_score = -1,
    gap_open_penalty: 0,
    gap_extend_penalty: 0,
)

Alignments are accepted if they produce a score above a set threshold. The threshold for considering an acceptable alignment can be adjusted with the accept_prefix_alignment and accept_suffix_alignment arguments. By default, both thresholds are set to 0.75.

The thresholds are represent a percentage of the maximum alignment score. So, a value of 0.75 means alignments producing scores that are greater than 75% the maximum theoretical score will be accepted. Thus, valid values are between 0 and 1.

Either an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant. In order to skip alignment and only look for exact matches, set the alignment thresolds to 1 (i.e., accept_suffix_alignment = 1 to only allow perfect matches of the suffix adapter).

Miscellaneous

Q: I don't need the amino acid sequence. Can I just get the DNA sequence?

A: Yes. Just set skip_translation to True.

# ...
dna_seqs = find_variants(fq_path, adapters, skip_translation=True)

Q: I don't want to use polars. Can I use pandas instead?

A: Yes. Use the to_pandas method on the dataframe.


Q: I have a lot of data and find_variants is slow. Is there anything I can do to speed it up?

A: Maybe. Try changing the number of threads or queue length the function uses.

# ...
variants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)

For more usage details, see the API reference.

Contributing

Feedback is a gift and contributions are more than welcome. Please submit an issue or pull request for any bugs, suggestions, or feedback. Please see the developing guide for more details on how to work on vFind.

License

vFind is licensed under the MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vfind-0.4.0.tar.gz (54.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vfind-0.4.0-cp313-cp313-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

vfind-0.4.0-cp313-cp313-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

vfind-0.4.0-cp312-cp312-manylinux_2_34_x86_64.whl (8.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

vfind-0.4.0-cp312-cp312-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

vfind-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

vfind-0.4.0-cp311-cp311-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

vfind-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file vfind-0.4.0.tar.gz.

File metadata

  • Download URL: vfind-0.4.0.tar.gz
  • Upload date:
  • Size: 54.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.0

File hashes

Hashes for vfind-0.4.0.tar.gz
Algorithm Hash digest
SHA256 34d10098d2f64867678a2860f587644493da81a6a493b5cc0b831fd713e43e3c
MD5 d4106475a687a3b57795c3f651222dc5
BLAKE2b-256 129a2712441600e7f73622ee0a5fb72fed35720e8d4df2c15ce65b3da8743a96

See more details on using hashes here.

File details

Details for the file vfind-0.4.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vfind-0.4.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 31f88210395787e62601eebb5028169e2a1ce08eca4c5ecbdb9daaba96b07181
MD5 a1426baf34fd551a884909029b3493c3
BLAKE2b-256 6e206ee8da7832f2a0be16b5261e790e33363cf9b2b3d8217c5b6b8819f13378

See more details on using hashes here.

File details

Details for the file vfind-0.4.0-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.4.0-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3796311010a8a3c9d4cdc454cd5ec8496f0aaeb7afcf08645bfc53d722825851
MD5 69fcb979cc97b993f886045d36669629
BLAKE2b-256 aff62beef0772e79830314e079ec50bb6690b95d84c05f14fbf6dcec0c0eac08

See more details on using hashes here.

File details

Details for the file vfind-0.4.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.4.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 740440e5e3877fa2d8a1e8d69dd99a1c0f6c195c08848a03f33798ec3f7534cc
MD5 48f5afe9e619e9ec3c2a48d88e4a7477
BLAKE2b-256 9e12ff5ff5dd1db22e4f20771018c25eaf65af838881a1c512f5b1e9976dc038

See more details on using hashes here.

File details

Details for the file vfind-0.4.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vfind-0.4.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8975fb6b564faa5013468e67f0cf9eb1095e20faa66a49ae74c890a0441a1f5
MD5 8c06aa424d987777e72d3fbb45e5186f
BLAKE2b-256 c1344de3f09c7c5f9615d144ba91c55ef49b29c023ab46a8ac543b0c29d406ef

See more details on using hashes here.

File details

Details for the file vfind-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 71e37cbc33e5dedfa40cc760430cf6bc4b2bd0268451ab41ade627926f2f76b3
MD5 7b3bf4457579d1ae439087edb7ea487a
BLAKE2b-256 f165ee99691060ba88ca88e22e49337676d8829f86750940a08999c4422dd979

See more details on using hashes here.

File details

Details for the file vfind-0.4.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vfind-0.4.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a01e7e7154311ccef021fb44d5154f225a6e77200e01d3caa9e0629514b87381
MD5 31efd3640980bf2409661c16b46d7bb7
BLAKE2b-256 f53153d4b635ed68c21b2c80ca39ddb7856f71652fb3e1e07804a657b6f62e69

See more details on using hashes here.

File details

Details for the file vfind-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vfind-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3440dcc1eabeed5adb0c36600b818d325964d15d25f9bdc5997aa8b32549fe5f
MD5 d82f099420b7e86e467d94e58e1c1c61
BLAKE2b-256 52c2150ae591565f96c43f910165ade25e2d788db1f0bb99264a86cbc953169a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page