Bioinformatics tool for compering large sequence files

These details have not been verified by PyPI

Project description

Database comparator

The Database Comparator is a versatile tool designed for searching, analyzing, and comparing biological sequence databases. It supports various algorithms, including exact matching, sequence alignment, BLAST searches, and Hamming distance calculations, facilitating comprehensive analysis of DNA and protein sequences. The program is highly customizable, allowing users to adjust parameters to suit their specific needs. It also supports multiprocessing, enabling faster processing of large datasets. The Database Comparator is a valuable resource for researchers, bioinformaticians, and anyone working with biological sequence data.

Installation
Docker
Configuration file
Usage

Installation

Use the following command to install the program:

pip install Database-comparator

or clone the repository and install the program manually:

git clone https://github.com/preislet/Database_comparator.git
cd Database_comparator

BLAST needs to be installed manually.

Docker

Docker file is provided in repository. To build and run the Docker image, follow these steps:

Step 1:

Run the following command to build the Docker image. Replace <image_name> with a name for your image, and optionally specify a tag (e.g., latest):

docker build -t <image_name>:<tag> .

Step 2:

After the image is successfully built, you can run a container from it:

docker run -e PASSWORD=rstudio --rm -p 8787:8787 <image_name>:<tag>

Step 3:

Open a web browser and go to http://localhost:8787.

Configuration file

The configuration file is used to adjust the program properly to the data that the user wants to analyze. The configuration folder contains all the information from the database query and the databases against which we want to compare the query. Optionally, internal parameters for the Smith Waterman algorithm, BLAST, etc. can be set. If these parameters are not specified, they will be set to the default value. Configuration file can be in . txt or .xlsx format. We highly recommend using .xlsx format because it is more user-friendly.

Configuration file .txt format

The table below describes all available configuration options for the Database Comparator.

Option Name	Description	Type	Default Value	Example Values
`DB`	Defines a database path, sequence column, and result column.	String	None	`DB path/to/db.csv seq_col result_col [identifiers]`
`QUERY`	Specifies the query file path and sequence column name.	String	None	`QUERY path/to/query.csv seq_col`
`SWA_tolerance`	Tolerance for Smith-Waterman alignment.	Float	`0.93`	`0.95`, `0.9`
`SWA_gap_score`	Gap penalty for Smith-Waterman alignment.	Float	None	`-2.0`, `-3.0`
`SWA_mismatch_score`	Mismatch penalty for Smith-Waterman alignment.	Float	None	`-1.0`, `-2.0`
`SWA_match_score`	Match reward for Smith-Waterman alignment.	Float	None	`2.0`, `3.0`
`SWA_matrix`	Substitution matrix for alignment.	String	None	`BLOSUM62`, `PAM250`
`SWA_mode`	Alignment mode (`global` or `local`).	String	None	`local`, `global`
`BLAST_e_value`	E-value threshold for BLAST searches.	Float	`0.05`	`1e-5`, `0.01`
`BLAST_database_name`	Name of the BLAST database.	String	`"clip_seq_db"`	`"Any_name"`
`BLAST_output_name`	Name of the BLAST output file.	String	`"blastp_output.txt"`	`"output.txt"`, `"results.tsv"`
`HD_max_distance`	Maximum allowed Hamming distance.	Integer	`1`	`2`, `5`, `10`
`number_of_processors`	Number of CPU cores to use for multiprocessing.	Integer	`1`	`2`, `4`, `8`
`separator`	Separator for results in the input DataFrame.	String	`"\n"`	`";"`, `","`, `" "`

Notes:

The DB and QUERY parameters are required in the configuration file.
Some parameters (like SWA_tolerance, SWA_match_score, etc.) are specific to Smith-Waterman alignment.
The BLAST_* parameters configure BLAST sequence searches.
HD_max_distance is used for Hamming distance calculations.
separator determines how multiple results are stored in the output file.

Example of configuration file:

# Databases
QUERY HEDIMED__230620_Hedimed_1_22_basic--table_EF_predelana.xlsx part3

DB Databases/Nakayama.csv CDR3b [Clone/SequenceID, Epitope]
DB Databases/McPAS-TCR-filtred.csv CDR3.beta.aa [PubMed.ID, Pathology, Additional.study.details]
DB Databases/vdjdb.csv cdr3 [antigen.gene, antigen.species, mhc.a, gene]
DB Databases/TCRdb_all_sequnces.csv AASeq [TCRDB_project_ID, RunId, cloneFraction]

# Smith–Waterman algorithm
SWA_tolerance 0.9
SWA_gap_score -1000
SWA_mismatch_score 0
SWA_match_score 1

# Blastp Algorithm
BLAST_e_value 0.05
BLAST_database_name clip_seq_db
BLAST_output_name blastp_output.txt

# Hamming distance
HD_max_distance 1

# Multiprocessing
number_of_processors 3

Syntax of config file:

# QUERY - query database 
QUERY >Name of query database< >Name of column with sequence<

# DB - Databases with the data we want to analyze
DB >Name of data database< >Name of column with sequence< >identifiers of sequence<

# SWA_tolerance - tolerance of Smith Waterman algorithm (score/max_score)
SWA_tolerance >float<

# Smith Waterman scoring
SWA_gap_score >int<
SWA_mismatch_score >int<
SWA_match_score >int<
SWA_matrix >name of scoring matrix<
SWA_mode >local | global<

BLAST_e_value >float<
BLAST_database_name >the name of the blast database that will be created if needed<
BLAST_output_name >name of output file<

HD_max_distance >Maximum Hamming distance(int)<

number_of_processors >number of processors for multprocessing(int)<

Notes:

if you want to use the default value for some parameter, you can skip it in the configuration file. Default values are shown in the table above.

Configuration file .xlsx format

.xlsx format is more user-friendly and allows for easier configuration of the program. The Default .xlsx file is provided in github repository. The user can modify it according to their needs. The .xlsx file contains several sheets, each with a different purpose. All tables are predefined, and the user only needs to fill in the necessary data. Cells the yellow color are only cells that the user can modify. If the user wants to use the default value for some parameter, they can leave the cell empty.

The first sheet is the Query sheet, where the user can specify the query database and the databases against which they want to compare the query. It also contains the Sepataor parameter, which determines how multiple results are stored in the output file and Number of processors parameter, which determines the number of CPU cores to use for multiprocessing. The Aligner sheet is used to set parameters for the Smith-Waterman algorithm, such as tolerance, gap score, mismatch score, match score, scoring matrix, and alignment mode. The BLAST sheet is used to configure BLAST searches, including the E-value threshold, database name, and output file name. The Hamming_distance sheet is used to set the maximum allowed Hamming distance.

Notes:

The Query sheet is required in the configuration file.
The Aligner, BLAST, and Hamming_distance sheets are optional.

Inserting config file to program:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

This way the program will read the configuration file and set all parameters according to the configuration file. It also checks if the configuration file is correct. If the configuration file is not correct. Program also preload the query database. The other databases are loaded when needed due to memory optimization.

Usage

Exact match

The exact_match module is used to find exact matches between sequences in the query database and data databases. It allows you to perform exact match searches in single databases or across all configured databases. Users can also take advantage of multiprocessing to expedite the process.

Example of exact match search in single database (first database in the configuration file):

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.exact_match.exact_match_search_in_single_database(database_index=0, parallel=True) # Multiprocessing enabled (parallel=True)

Example of exact match search in all databases:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.exact_match.exact_match_search_in_all_databases(parallel=True) # Multiprocessing enabled (parallel=True)

Aligner

The aligner module is based on the Smith-Waterman/Needleman-Wunsch algorithm for sequence alignment. It provides the capability to execute single-core or multiprocessing-based match searches. Algorithm complexity is O(n*m), where n is the length of the first sequence and m is the length of the second sequence. Tolernace parameter is used to determine the minimum score that the alignment must achieve to be considered a hit. The gap score, mismatch score, and match score are used to calculate the alignment score. The scoring matrix is used to determine the score for each pair of aligned residues. The alignment mode can be set to either global or local.

Example of Smith-Waterman algorithm search in single database (first database in the configuration file):

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.aligner.aligner_search_in_single_database(database_index=0, parallel=True) # Multiprocessing enabled (parallel=True)

Example of Smith-Waterman algorithm search in all databases:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.aligner.aligner_search_in_all_databases(parallel=True) # Multiprocessing enabled (parallel=True)

BLAST

The blast module enables users to create BLAST databases, perform BLAST searches for matches, and analyze the results using the aligner. The E-value threshold is used to determine the significance of the match. The database name is used to specify the name of the BLAST database that will be created if needed. The output name is used to specify the name of the output file. In future versions, the user will be able to specify if they want to use aligner or hammer distance to analyze the results.

Example of BLAST search in database:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db.blast.blast_database_info() # Provides information about the BLAST database
  
  db.blast.blast_make_database(name="BLAST_Database") # Creates BLAST database
  db.blast.blast_search_for_match_in_database() #Query is input database
  db.blast.analyze_matches_in_database() #BLAST output will be analyzed with aligner

  """
  User can also use this function.
  db.blast.blast_search_and_analyze_matches_in_database() - This function will perform both BLAST search and analyze the results with aligner
  """

Hamming distances

The hamming_distances module calculates Hamming distances between sequences. Users can explore Hamming distances in single databases or across all databases. The maximum allowed Hamming distance is used to determine the maximum number of mismatches allowed between two sequences. The user can also analyze the Hamming distance matrices to identify patterns in the data. The Hamming distances can be calculated using standard hamming distance function, that will return matrix with hamming distances between all sequences in the database. This matrix can be analyzed using the analyze_single_hamming_matrix function. The user can also calculate Hamming distances for all databases and analyze them using the analyze_all_hamming_matrices function.

Example of Hamming distance search in single database (first database in the configuration file):

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)
  
  # Hamming distances will be analyzed - The hits under the maximum allowed Hamming distance will be stored in the output file
  db.hamming_distances.find_hamming_distances_for_single_database(database_index=0, analyze=True) 

  # Hamming matrices are stored in >hamming_matrices_for_all_databases<
  db_matrices = db.hamming_distances.hamming_matrices_for_all_databases

This aproach is very space consuming, so the user can also calculate Hamming distances without generating the matrix. This aproach is much faster and uses less memory. Sequnces that are under the maximum allowed Hamming distance will be stored in the output file. No further analysis of the matrix is possible, because was never generated.

Example of fast Hamming distance search in single database (first database in the configuration file):

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db_comp.fast_hamming_distances.find_hamming_distances_for_single_database(0, parallel=True) # Multiprocessing enabled (parallel=True)

Example of fast Hamming distance search in all databases:

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  db_comp.fast_hamming_distances.find_hamming_distances_for_all_databases(parallel=True) # Multiprocessing enabled (parallel=True)

Exporting results

The program is also capable of exporting the results to a .csv/.xlsx file. The user can specify the path to the output file and the separator used to separate the multiple results for same sequence. The separator is defined in the configuration file. If the file exceeds the maximum allowed size for .xlsx files, the program will automatically generate .csv backup file.

Example of exporting results:

from Database_comparator import db_compare

if __name__ == "__main__":
  cfg_file = 'path_to_config_file.txt'
  db = db_compare.DB_comparator(cfg_file)

  # Data computing...

  db.export_data_frame(output_file="MyAnalysis.xlsx", data_format="xlsx")
  db.export_data_frame(output_file="MyAnalysis.csv", data_format="csv")

The dataframe can be also accessed using the db.config.input_df attribute.

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Science/Research
Operating System
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.1.6

Sep 28, 2025

1.1.5

Jun 9, 2025

1.1.4

Jun 8, 2025

1.1.3

Jun 8, 2025

1.1.2

Jun 7, 2025

This version

1.1.1

Jun 7, 2025

1.1.0

Jun 7, 2025

1.0.9

Feb 23, 2025

1.0.8

Feb 23, 2025

1.0.7

Feb 21, 2025

1.0.6

Feb 19, 2025

1.0.5

Feb 16, 2025

1.0.4

Feb 16, 2025

1.0.3

Oct 27, 2023

1.0.2

Oct 19, 2023

1.0.1

Oct 19, 2023

1.0.0

Oct 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

database_comparator-1.1.1.tar.gz (42.6 kB view details)

Uploaded Jun 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

database_comparator-1.1.1-py3-none-any.whl (47.3 kB view details)

Uploaded Jun 7, 2025 Python 3

File details

Details for the file database_comparator-1.1.1.tar.gz.

File metadata

Download URL: database_comparator-1.1.1.tar.gz
Upload date: Jun 7, 2025
Size: 42.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for database_comparator-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`164fe7ec624dbe75be7effa1b7046d79bd67ad79480b049634c711004e4700cc`
MD5	`1648543b443aa121fb8c58a606c2d32d`
BLAKE2b-256	`bc0ac7411cc5773e7ebc6277f014c0df16cd5f292bd6e746073fd0cee45a6fe7`

See more details on using hashes here.

File details

Details for the file database_comparator-1.1.1-py3-none-any.whl.

File metadata

Download URL: database_comparator-1.1.1-py3-none-any.whl
Upload date: Jun 7, 2025
Size: 47.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for database_comparator-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c02ba80353e009198b8a019df2b4512184a85592d8ba2bac369921a3732fd5f`
MD5	`8ba43a3f9fe4a2ffd920389f7b971d56`
BLAKE2b-256	`c1208076c70000101bf040211232e7083b05fd15be6936cc00d61ed67dd05df0`

See more details on using hashes here.

database-comparator 1.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Database comparator

Table of Contents

Installation

Docker

Step 1:

Step 2:

Step 3:

Configuration file

Configuration file .txt format

Notes:

Syntax of config file:

Notes:

Configuration file .xlsx format

Notes:

Inserting config file to program:

Usage

Exact match

Example of exact match search in single database (first database in the configuration file):

Example of exact match search in all databases:

Aligner

Example of Smith-Waterman algorithm search in single database (first database in the configuration file):

Example of Smith-Waterman algorithm search in all databases:

BLAST

Example of BLAST search in database:

Hamming distances

Example of Hamming distance search in single database (first database in the configuration file):

Example of fast Hamming distance search in single database (first database in the configuration file):

Example of fast Hamming distance search in all databases:

Exporting results

Example of exporting results:

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes