Skip to main content

Bioinformatics tool for compering large sequence files

Project description

Database comparator

PyPI version Downloads License: MIT

The Database Comparator is a versatile tool designed for searching, analyzing, and comparing biological sequence databases. It supports various algorithms, including exact matching, sequence alignment, BLAST searches, and Hamming distance calculations, facilitating comprehensive analysis of DNA and protein sequences. The program is highly customizable, allowing users to adjust parameters to suit their specific needs. It also supports multiprocessing, enabling faster processing of large datasets. The Database Comparator is a valuable resource for researchers, bioinformaticians, and anyone working with biological sequence data.

Table of Contents

Installation

Use the following command to install the program:

pip install Database-comparator

or clone the repository and install the program manually:

git clone https://github.com/preislet/Database_comparator.git
cd Database_comparator

BLAST needs to be installed manually.

Docker

Docker file is provided in repository. To build and run the Docker image, follow these steps:

Step 1:

Run the following command to build the Docker image. Replace <image_name> with a name for your image, and optionally specify a tag (e.g., latest):

docker build -t <image_name>:<tag> .

Step 2:

After the image is successfully built, you can run a container from it:

docker run -e PASSWORD=rstudio --rm -p 8787:8787 <image_name>:<tag>

Step 3:

Open a web browser and go to http://localhost:8787.

  • Log in to RStudio using the default credentials:
    • Username: rstudio
    • Password: rstudio

Configuration file

The configuration file is used to adjust the program properly to the data that the user wants to analyze. The configuration folder contains all the information from the database query and the databases against which we want to compare the query. Optionally, internal parameters for the Smith Waterman algorithm, BLAST, etc. can be set. If these parameters are not specified, they will be set to the default value. Configuration file can be in . txt or .xlsx format. We highly recommend using .xlsx format because it is more user-friendly.

Configuration file .txt format

The table below describes all available configuration options for the Database Comparator.

Option Name Description Type Default Value Example Values
DB Defines a database path, sequence column, and result column. String None DB path/to/db.csv seq_col result_col [identifiers]
QUERY Specifies the query file path and sequence column name. String None QUERY path/to/query.csv seq_col
SWA_tolerance Tolerance for Smith-Waterman alignment. Float 0.93 0.95, 0.9
SWA_gap_score Gap penalty for Smith-Waterman alignment. Float None -2.0, -3.0
SWA_mismatch_score Mismatch penalty for Smith-Waterman alignment. Float None -1.0, -2.0
SWA_match_score Match reward for Smith-Waterman alignment. Float None 2.0, 3.0
SWA_matrix Substitution matrix for alignment. String None BLOSUM62, PAM250
SWA_mode Alignment mode (global or local). String None local, global
BLAST_e_value E-value threshold for BLAST searches. Float 0.05 1e-5, 0.01
BLAST_database_name Name of the BLAST database. String "clip_seq_db" "Any_name"
BLAST_output_name Name of the BLAST output file. String "blastp_output.txt" "output.txt", "results.tsv"
HD_max_distance Maximum allowed Hamming distance. Integer 1 2, 5, 10
number_of_processors Number of CPU cores to use for multiprocessing. Integer 1 2, 4, 8
separator Separator for results in the input DataFrame. String "\n" ";", ",", " "

Notes:

  • The DB and QUERY parameters are required in the configuration file.
  • Some parameters (like SWA_tolerance, SWA_match_score, etc.) are specific to Smith-Waterman alignment.
  • The BLAST_* parameters configure BLAST sequence searches.
  • HD_max_distance is used for Hamming distance calculations.
  • separator determines how multiple results are stored in the output file.

Example of configuration file:

# Databases
QUERY HEDIMED__230620_Hedimed_1_22_basic--table_EF_predelana.xlsx part3

DB Databases/Nakayama.csv CDR3b [Clone/SequenceID, Epitope]
DB Databases/McPAS-TCR-filtred.csv CDR3.beta.aa [PubMed.ID, Pathology, Additional.study.details]
DB Databases/vdjdb.csv cdr3 [antigen.gene, antigen.species, mhc.a, gene]
DB Databases/TCRdb_all_sequnces.csv AASeq [TCRDB_project_ID, RunId, cloneFraction]

# Smith–Waterman algorithm
SWA_tolerance 0.9
SWA_gap_score -1000
SWA_mismatch_score 0
SWA_match_score 1

# Blastp Algorithm
BLAST_e_value 0.05
BLAST_database_name clip_seq_db
BLAST_output_name blastp_output.txt

# Hamming distance
HD_max_distance 1

# Multiprocessing
number_of_processors 3

Syntax of config file:

# QUERY - query database 
QUERY >Name of query database< >Name of column with sequence<

# DB - Databases with the data we want to analyze
DB >Name of data database< >Name of column with sequence< >identifiers of sequence<

# SWA_tolerance - tolerance of Smith Waterman algorithm (score/max_score)
SWA_tolerance >float<

# Smith Waterman scoring
SWA_gap_score >int<
SWA_mismatch_score >int<
SWA_match_score >int<
SWA_matrix >name of scoring matrix<
SWA_mode >local | global<

BLAST_e_value >float<
BLAST_database_name >the name of the blast database that will be created if needed<
BLAST_output_name >name of output file<

HD_max_distance >Maximum Hamming distance(int)<

number_of_processors >number of processors for multprocessing(int)<

Notes:

if you want to use the default value for some parameter, you can skip it in the configuration file. Default values are shown in the table above.

Configuration file .xlsx format

.xlsx format is more user-friendly and allows for easier configuration of the program. The Default .xlsx file is provided in github repository. The user can modify it according to their needs. The .xlsx file contains several sheets, each with a different purpose. All tables are predefined, and the user only needs to fill in the necessary data. Cells the yellow color are only cells that the user can modify. If the user wants to use the default value for some parameter, they can leave the cell empty.

The first sheet is the Query sheet, where the user can specify the query database and the databases against which they want to compare the query. It also contains the Sepataor parameter, which determines how multiple results are stored in the output file and Number of processors parameter, which determines the number of CPU cores to use for multiprocessing. The Aligner sheet is used to set parameters for the Smith-Waterman algorithm, such as tolerance, gap score, mismatch score, match score, scoring matrix, and alignment mode. The BLAST sheet is used to configure BLAST searches, including the E-value threshold, database name, and output file name. The Hamming_distance sheet is used to set the maximum allowed Hamming distance.

Notes:

  • The Query sheet is required in the configuration file.
  • The Aligner, BLAST, and Hamming_distance sheets are optional.

Inserting config file to program:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

This way the program will read the configuration file and set all parameters according to the configuration file. It also checks if the configuration file is correct. If the configuration file is not correct. Program also preload the query database. The other databases are loaded when needed due to memory optimization.

Usage

Exact match

The exact_match module is used to find exact matches between sequences in the query database and data databases. It allows you to perform exact match searches in single databases or across all configured databases. Users can also take advantage of multiprocessing to expedite the process.

Example of exact match search in single database (first database in the configuration file):

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.exact_match.exact_match_search_in_single_database(database_index=0, parallel=True) # Multiprocessing enabled (parallel=True)

Example of exact match search in all databases:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.exact_match.exact_match_search_in_all_databases(parallel=True) # Multiprocessing enabled (parallel=True)

Aligner

The aligner module is based on the Smith-Waterman/Needleman-Wunsch algorithm for sequence alignment. It provides the capability to execute single-core or multiprocessing-based match searches. Algorithm complexity is O(n*m), where n is the length of the first sequence and m is the length of the second sequence. Tolernace parameter is used to determine the minimum score that the alignment must achieve to be considered a hit. The gap score, mismatch score, and match score are used to calculate the alignment score. The scoring matrix is used to determine the score for each pair of aligned residues. The alignment mode can be set to either global or local.

Example of Smith-Waterman algorithm search in single database (first database in the configuration file):

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.aligner.aligner_search_in_single_database(database_index=0, parallel=True) # Multiprocessing enabled (parallel=True)

Example of Smith-Waterman algorithm search in all databases:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.aligner.aligner_search_in_all_databases(parallel=True) # Multiprocessing enabled (parallel=True)

BLAST

The blast module enables users to create BLAST databases, perform BLAST searches for matches, and analyze the results using the aligner. The E-value threshold is used to determine the significance of the match. The database name is used to specify the name of the BLAST database that will be created if needed. The output name is used to specify the name of the output file. In future versions, the user will be able to specify if they want to use aligner or hammer distance to analyze the results.

Example of BLAST search in database:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.blast.blast_database_info() # Provides information about the BLAST database
  
  db.blast.blast_make_database(name="BLAST_Database") # Creates BLAST database
  db.blast.blast_search_for_match_in_database() #Query is input database
  db.blast.analyze_matches_in_database() #BLAST output will be analyzed with aligner

  """
  User can also use this function.
  db.blast.blast_search_and_analyze_matches_in_database() - This function will perform both BLAST search and analyze the results with aligner
  """

Hamming distances

The hamming_distances module calculates Hamming distances between sequences. Users can explore Hamming distances in single databases or across all databases. The maximum allowed Hamming distance is used to determine the maximum number of mismatches allowed between two sequences. The user can also analyze the Hamming distance matrices to identify patterns in the data. The Hamming distances can be calculated using standard hamming distance function, that will return matrix with hamming distances between all sequences in the database. This matrix can be analyzed using the analyze_single_hamming_matrix function. The user can also calculate Hamming distances for all databases and analyze them using the analyze_all_hamming_matrices function.

Example of Hamming distance search in single database (first database in the configuration file):

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)
  
  # Hamming distances will be analyzed - The hits under the maximum allowed Hamming distance will be stored in the output file
  db.hamming_distances.find_hamming_distances_for_single_database(database_index=0, analyze=True) 

  # Hamming matrices are stored in >hamming_matrices_for_all_databases<
  db_matrices = db.hamming_distances.hamming_matrices_for_all_databases

This aproach is very space consuming, so the user can also calculate Hamming distances without generating the matrix. This aproach is much faster and uses less memory. Sequnces that are under the maximum allowed Hamming distance will be stored in the output file. No further analysis of the matrix is possible, because was never generated.

Example of fast Hamming distance search in single database (first database in the configuration file):

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db_comp.fast_hamming_distances.find_hamming_distances_for_single_database(0, parallel=True) # Multiprocessing enabled (parallel=True)

Example of fast Hamming distance search in all databases:

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db_comp.fast_hamming_distances.find_hamming_distances_for_all_databases(parallel=True) # Multiprocessing enabled (parallel=True)

Exporting results

The program is also capable of exporting the results to a .csv/.xlsx file. The user can specify the path to the output file and the separator used to separate the multiple results for same sequence. The separator is defined in the configuration file. If the file exceeds the maximum allowed size for .xlsx files, the program will automatically generate .csv backup file.

Example of exporting results:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  # Data computing...

  db.export_data_frame(output_file="MyAnalysis.xlsx", data_format="xlsx")
  db.export_data_frame(output_file="MyAnalysis.csv", data_format="csv")

The dataframe can be also accessed using the db.config.input_df attribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

database_comparator-1.1.1.tar.gz (42.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

database_comparator-1.1.1-py3-none-any.whl (47.3 kB view details)

Uploaded Python 3

File details

Details for the file database_comparator-1.1.1.tar.gz.

File metadata

  • Download URL: database_comparator-1.1.1.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for database_comparator-1.1.1.tar.gz
Algorithm Hash digest
SHA256 164fe7ec624dbe75be7effa1b7046d79bd67ad79480b049634c711004e4700cc
MD5 1648543b443aa121fb8c58a606c2d32d
BLAKE2b-256 bc0ac7411cc5773e7ebc6277f014c0df16cd5f292bd6e746073fd0cee45a6fe7

See more details on using hashes here.

File details

Details for the file database_comparator-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for database_comparator-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4c02ba80353e009198b8a019df2b4512184a85592d8ba2bac369921a3732fd5f
MD5 8ba43a3f9fe4a2ffd920389f7b971d56
BLAKE2b-256 c1208076c70000101bf040211232e7083b05fd15be6936cc00d61ed67dd05df0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page