A command-line tool to mitigate homology-based data leakage in sequence-to-expression models
Project description
Overview
Neural networks have emerged as powerful tools to understand the functional relationship between genomic sequences and various biological processes. However, current practices of training and evaluating models on genomic sequences may fail to account for the widespread homology that permeates the genome. Homology spanning train-test data splits can result in data leakage, potentially leading to overestimation of model performance and a reduction in model reliability and generalizability.
hashFrag is a scalable command-line tool to help users address homology-based data leakage during model development. The general workflow involves identifying “candidate” pairs of sequences exhibiting high similarity with BLAST, filtering these candidates based on a specified similarity threshold, and then using the resulting homology information to mitigate the potential occurrences of data leakage in existing or newly-created splits.
Documentation
Full documentation is available on Read the Docs.
Paper
Check out our preprint on bioRxiv titled, "Detecting and avoiding homology-based data leakage in genome-trained sequence models", for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hashfrag-1.0.2.tar.gz.
File metadata
- Download URL: hashfrag-1.0.2.tar.gz
- Upload date:
- Size: 18.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11b85df62115e413c07d90a35fe9d590af3afa64f1e874dce14091c359ec8764
|
|
| MD5 |
c37d1bdc34e86fd70511f65975a683b8
|
|
| BLAKE2b-256 |
5ba080e8b9574f745e063572c4840fd7e3faf4ce9f1605164d9d982f19f73b1e
|
File details
Details for the file hashfrag-1.0.2-py3-none-any.whl.
File metadata
- Download URL: hashfrag-1.0.2-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
601739997b2b28adf5ba5936976f8f8c5958cff5d68cb2722219377cfbbb43ad
|
|
| MD5 |
f08a86042460560671301e850334505f
|
|
| BLAKE2b-256 |
d4fd5e40bab724a3b91e70203bcc0db9b498c8f0a5c4d44c8cd0c3f364f11c1a
|