Skip to main content

A simple variant finder for NGS data

Project description

vFind

A simple variant finder for NGS data.

  1. Introduction
  2. Installation
  3. Examples
  4. Contributing
  5. License

Introduction

vFind is unlike a traditional variant caller. It is actually using a simpler algorithm which is usually sufficient for screening experiments. The main use case is finding variants from a library that has constant adapter sequences flanking a variable region.

The workflow can be described generally by these steps:

  1. define a set of constant adapter sequences that immediately flank a variable region of interest.
  2. for each fastq read, search for exact matches of these adapters.
  3. in case of no exact match for either adapter, perform a semi-global alignment of the adapter sequence on the fastq read.
  4. if the alignment is produces a score that is above a set threshold, recover the variable region. To adjust the thresholds for accepting alignments, see the alignment parameters section

Alignments are accepted if they produce a score above a set threshold. These thresholds correspond to the fraction of the max theoretical alignment score that can be produced from a given adapter sequence and fastq read. Thus, thresholds are values between 0 and 1. By default, thresholds for both adapter sequences are set to 0.75. In some cases, you may want to change these values to be more or less stringent.

[!WARNING] Note that vFind doesn't do any kind of fastq preprocessing. For initial quality filtering, merging, and other common preprocessing operations, you might be interested in something like fastp or ngmerge. We generally recommend using fastp for quality filtering and merging fastq files before using vFind.

Installation

PyPI (Recommended for most)

vFind is available on PyPI and can be installed via pip (or alternatives like uv).

Below is an example using pip with Python3 in a new project.

# create a new virtual env
python3 -m venv .venv # create a new virtual env if haven't already
source .venv/bin/activate # activate the virtual env

python3 -m pip install vfind # install vfind

Nix

vFind is also available as a Python package on NixPkgs. You can declare new development enviroments using nix flakes.

For something quick, you can use nix-shell,

nix-shell -p python311 python3Packages.vfind

Examples

Basic Usage

from vfind import find_variants
import polars as pl # variants are returned in a polars dataframe

adapters = ("GGG", "CCC") # define the Adapters
fq_path = "./path/to/your/fastq/file.fq.gz" # path fo fq file

# now, let's find some variants
variants = find_variants(fq_path, adapters)

# print the number of unique sequences 
print(variants.n_unique())

find_variants returns a polars dataframe with sequence and count columns. sequence contains the amino acid sequence of the variable regions and count contains the frequency of those variant.

We can then use dataframe methods to further analyze the recovered variants. Some examples are shown below.

# Get the top 5 most frequent variants
variants.sort("count", descending=True) # sort by the counts in descending order
print(variants.head(5)) # print the first 5 (most frequent) variants

# filter out sequences with less than 10 read counts
# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)

filtered_variants = variants.filter(
    ~variants["counts"] < 10,
    ~variants["sequence"][::-2].str.contains("*")
)

# write the filtered variants to a csv file
filtered_variants.write_csv("filtered_variants.csv")

Using Custom Alignment Parameters

By default, vFind uses semi-global alignment with the following parameters:

  • match score = 3
  • mismatch score = -2
  • gap open penalty = 5
  • gap extend penalty = 2

Note that the gap penalties are represented as positive integers. This is largely due to the underlying alignment library.

from vfind import find_variants

# ... define adapters and fq_path

# use identity scoring with no gap penalties for alignments
variants = find_variants(
    fq_path,
    adapters,
    match_score = 1,
    mismatch_score = -1,
    gap_open_penalty: 0,
    gap_extend_penalty: 0,
)

Only alignments that produce scores meeting a threshold will be considered accepted. The threshold for considering an acceptable alignment can be adjusted with the accept_prefix_alignment and accept_suffix_alignment arguments. By default, both thresholds are set to 0.75.

The thresholds represent a fraction of the maximum alignment score. So, a value of 0.75 means alignments producing scores that are greater than 75% the maximum theoretical score will be accepted.

Either an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant.

Miscellaneous

Q: I don't need the amino acid sequence. Can I just get the DNA sequence?

A: Yes. Just set skip_translation to True.

# ...
dna_seqs = find_variants(fq_path, adapters, skip_translation=True)

Q: I don't want to use polars. Can I use pandas instead?

A: Yes. Use the to_pandas method on the dataframe.


Q: find_variants is slow. Is there anything I can do to speed it up?

A: Maybe? Try changing the number of threads or queue length the function uses.

# ...
variants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)

Contributing

Contributions are welcome. Please submit an issue or pull request for any bugs, suggestions, or feedback.

Developing

vFind is written in Rust and uses PyO3 and Maturin to generate the Python module. To get started, you will need to have the rust toolchain and Python >= 3.10.

Below are some general steps to get up and running. Note that these examples use [uv]). However, you could do this with standard pip or your preferred method.

  1. clone the repository to your machine
git clone git@github.com:nsbuitrago/vfind.git
cd vfind
  1. Create a new Python virtual environment and install dev dependencies.
uv venv
source .venv/bin/activate

# this will intsall maturin, polars, and any other required packages.
uv pip install -r dev-requirements.txt
  1. Build and install the vFind package in your virtual environment with maturin.
# in the root project directory
maturin develop
  1. From here, you can make changes to the Rust lib and run maturin develop to rebuild the package with those changes.

License

vFind is licensed under the MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

vfind-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (3.0 MB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page