Skip to main content

A command-line tool to mitigate homology-based data leakage in sequence-to-expression models

Project description

Documentation Status PyPI version

Overview

Neural networks have emerged as powerful tools to understand the functional relationship between genomic sequences and various biological processes. However, current practices of training and evaluating models on genomic sequences may fail to account for the widespread homology that permeates the genome. Homology spanning train-test data splits can result in data leakage, potentially leading to overestimation of model performance and a reduction in model reliability and generalizability.

hashFrag is a scalable command-line tool to help users address homology-based data leakage during model development. The general workflow involves identifying “candidate” pairs of sequences exhibiting high similarity with BLAST, filtering these candidates based on a specified similarity threshold, and then using the resulting homology information to mitigate the potential occurrences of data leakage in existing or newly-created splits.

Documentation

Full documentation is available on Read the Docs.

Paper

Check out our preprint on bioRxiv titled, "Detecting and avoiding homology-based data leakage in genome-trained sequence models", for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashfrag-1.0.2.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hashfrag-1.0.2-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file hashfrag-1.0.2.tar.gz.

File metadata

  • Download URL: hashfrag-1.0.2.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for hashfrag-1.0.2.tar.gz
Algorithm Hash digest
SHA256 11b85df62115e413c07d90a35fe9d590af3afa64f1e874dce14091c359ec8764
MD5 c37d1bdc34e86fd70511f65975a683b8
BLAKE2b-256 5ba080e8b9574f745e063572c4840fd7e3faf4ce9f1605164d9d982f19f73b1e

See more details on using hashes here.

File details

Details for the file hashfrag-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: hashfrag-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for hashfrag-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 601739997b2b28adf5ba5936976f8f8c5958cff5d68cb2722219377cfbbb43ad
MD5 f08a86042460560671301e850334505f
BLAKE2b-256 d4fd5e40bab724a3b91e70203bcc0db9b498c8f0a5c4d44c8cd0c3f364f11c1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page