A simple variant finder for NGS data
Project description
vFind
A simple variant finder for NGS data.
Introduction
vFind is unlike a traditional variant caller. It is actually using a simpler algorithm which is usually sufficient for screening experiments. The main use case is finding variants from a library that has constant adapter sequences flanking a variable region.
This simple algorithm is summarized as:
- Define a pair of adapter sequences that flank the variable region.
- For each fastq read, search for exact matches of these adapters.
- If both adapters are found exactly, recover the variable region.
- For each adapter without an exact match, perform semi-global alignment between the given adapter and read (optional see the alignment parameters section).
- If the alignment score meets a set threshold, that adapter is considered to match.
- If both adapters are exactly or partially matched, recover the variable region.
- For exact matches of both adapters, recover the variable region. Otherwise, continue to the next read.
- Finally, translate the variable region to its amino acid sequence and filter out any sequences with partial codons (Optional, see the miscellaneuous section).
[!WARNING] Note that vFind doesn't do any kind of preprocessing. For initial quality filtering, merging, and other common preprocessing operations, you might be interested in something like fastp or ngmerge. We generally recommend using fastp for quality filtering and merging fastq files before using vFind.
Installation details and usage examples are given below. For more usage details, please see the API reference
Installation
vFind is a Python package and can be installed via pip or nix. For a CLI version, see the vFind-cli repository.
PyPI (Recommended for most)
The package is available on PyPI and can be installed via pip (or alternatives like uv).
Below is an example using pip with Python3 in a new project.
# create a new virtual env
python3 -m venv .venv # create a new virtual env if haven't already
source .venv/bin/activate # activate the virtual env
python3 -m pip install vfind # install vfind
Nix
vFind is also available on NixPkgs. You can declare new enviroments using nix flakes.
For something quick, you can use nix-shell. For example, the following will create a new shell with Python 3.11, vFind, and polars installed.
nix-shell -p python311 python3Packages.vfind python3Packages.polars
Examples
Basic Usage
from vfind import find_variants
import polars as pl # variants are returned in a polars dataframe
adapters = ("GGG", "CCC") # define the adapters
fq_path = "./path/to/your/fastq/file.fq.gz" # path to fq file
variants = find_variants(fq_path, adapters)
# print the number of unique sequences
print(variants.n_unique())
find_variants
returns a polars dataframe with sequence
and count
columns.
sequence
contains the amino acid sequence of the variable regions and
count
contains the frequency of those variant.
We can then use dataframe methods to further analyze the recovered variants. Some examples are shown below.
# Get the top 5 most frequent variants
variants.sort("count", descending=True) # sort by the counts in descending order
print(variants.head(5)) # print the first 5 (most frequent) variants
# filter out sequences with less than 10 read counts
# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)
filtered_variants = variants.filter(
variants["count"] > 10,
~variants["sequence"][::-2].str.contains("*")
)
# write the filtered variants to a csv file
filtered_variants.write_csv("filtered_variants.csv")
Using Custom Alignment Parameters
By default, vFind uses semi-global alignment with the following parameters:
- match score = 3
- mismatch score = -2
- gap open penalty = 5
- gap extend penalty = 2
Note that the gap penalties are represented as positive integers. This is largely due to how the underlying alignment library works.
To adjust these alignment parameters, use the match_score
, mismatch_score
,
gap_open_penalty
, and gap_extend_penalty
keyword arguments:
from vfind import find_variants
# ... define adapters and fq_path
# use identity scoring with no gap penalties for alignments
variants = find_variants(
fq_path,
adapters,
match_score = 1,
mismatch_score = -1,
gap_open_penalty: 0,
gap_extend_penalty: 0,
)
Alignments are accepted if they produce a score above a set threshold. The threshold
for considering an acceptable alignment can be adjusted with the accept_prefix_alignment
and accept_suffix_alignment
arguments. By default, both thresholds are set to 0.75.
The thresholds are represent a percentage of the maximum alignment score. So, a value of 0.75 means alignments producing scores that are greater than 75% the maximum theoretical score will be accepted. Thus, valid values are between 0 and 1.
Either an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant.
In order to skip alignment and only look for exact matches, set the skip_alignment
argument to True
.
Miscellaneous
Q: I don't need the amino acid sequence. Can I just get the DNA sequence?
A: Yes. Just set skip_translation
to True.
# ...
dna_seqs = find_variants(fq_path, adapters, skip_translation=True)
Q: I don't want to use polars. Can I use pandas instead?
A: Yes. Use the to_pandas
method on the dataframe.
Q: I have a lot of data and find_variants
is slow. Is there anything I can do to speed it up?
A: Maybe. Try changing the number of threads or queue length the function uses.
# ...
variants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)
For more usage details, see the API reference.
Contributing
Feedback is a gift and contributions are more than welcome. Please submit an issue or pull request for any bugs, suggestions, or feedback. Please see the developing guide for more details on how to work on vFind.
License
vFind is licensed under the MIT license
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file vfind-0.1.1.tar.gz
.
File metadata
- Download URL: vfind-0.1.1.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55ec520af5b0d8e7940a2ad0f937d54a64428a1871cdc66ac7c4e5ced412cc07 |
|
MD5 | 0a4db8040a435f4ef691efd8324ce668 |
|
BLAKE2b-256 | 05df4a4291b8ecc76e257affa271d380efe2290dcc431d6a5fba5208a0fc6b12 |
File details
Details for the file vfind-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
.
File metadata
- Download URL: vfind-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 7.6 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e780581b780573197be05c7d4187cb46e8649b6d14b7b7fcaf987ea97e30d366 |
|
MD5 | b9ae4879092346f1320602d3ec54ec0a |
|
BLAKE2b-256 | 04cedeca03e6b5d3a5cacf10c6476c55e546854a892e366926539a18e099f0ff |
File details
Details for the file vfind-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
.
File metadata
- Download URL: vfind-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0345988e9fe46b6d2567acf8704499ea127d957442a19b4b48f3681a342aefc |
|
MD5 | 752c9b79f229983a7ea9ea3226b57d1e |
|
BLAKE2b-256 | c76457ec9c39f731a7bf2ed61a798b17e5d0cadc7e236a6700664abc5a056469 |
File details
Details for the file vfind-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: vfind-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 3.5 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbfd7186b94216bd433bbb742089a6fef8504001935a589ed1d8ffa3e5ea4486 |
|
MD5 | 1a9a2abd45e1caf7d0898290b57e2a72 |
|
BLAKE2b-256 | 0748d5595b536d9b533edce049b4d17c4b5967d8b8dd8258363fd508b7ee4bc5 |
File details
Details for the file vfind-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: vfind-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7949fcd9595113514b72f2f9c3dd5d6b64da364081893f13919e0fcd06a0c97 |
|
MD5 | 6e013e75436c2a23a056d1100208e8f2 |
|
BLAKE2b-256 | e985231b54148e6cc35eb03f72aad849f1959936efee04f6180c1536d029b166 |
File details
Details for the file vfind-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: vfind-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 3.5 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25a8a1b214e03cf224fae66762643a9b38a6a8a9e8e2566e735c4a57862c3abb |
|
MD5 | 3859f7ef5cbedd524ea48d5113602e84 |
|
BLAKE2b-256 | c06d580cb81704abc2f249699d8cf81a079954b5ed3db46a89066722f0159041 |
File details
Details for the file vfind-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl
.
File metadata
- Download URL: vfind-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 7.6 MB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5975054db02dd05077e27e41d4ea0bd2d6caafc3d5b75de09b78ba1b1997541a |
|
MD5 | 90a7016684e436e3e132e4bae87b07f7 |
|
BLAKE2b-256 | 21b3638f55520f813c34721741a63e06a5bf123f43ccd42b9239ffea1d83f249 |