Graph-based partitioning of biological sequence data
Project description
GraphPart
Biological sequence dataset partitioning method
Graph-Part is a Python package for generating partitions (i.e. train-test splits, or splits for cross-validation) of biological sequence datasets. It ensures minimal homology between different partitions, while balancing partitions for labels or other desired criteria.
Preprint: https://www.biorxiv.org/content/10.1101/2023.04.14.536886v1
Installation
GraphPart relies on needleall from the EMBOSS package for Needleman-Wunsch alignments of sequences. Please refer to the official EMBOSS documentation for installation methods.
Additionally, GraphPart supports MMseqs2 for alignments. To use other algorithms that compute pairwise similarity measures, please refer to the precomputed
mode.
We recommend to install GraphPart in a conda environment, and install EMBOSS from bioconda. The same goes for MMseqs2.
conda install -c bioconda emboss
# If you also want to use the MMseqs2 mode
conda install -c conda-forge -c bioconda mmseqs2
Alternatively, on Ubuntu, EMBOSS is available directly via sudo apt-get install emboss
.
To install GraphPart, run
pip install graph-part
The command graphpart
will now be available on your command line.
Alternatively, you can install GraphPart from source:
conda install -c bioconda emboss
git clone https://github.com/graph-part/graph-part.git
cd graph-part
pip install .
Instructions
Command line interface for FASTA data
As an example, this is a basic command for partitioning a dataset at a maximum pairwise cross-partition identity of 30% into 5 folds. The resulting partitions are balanced for equal frequencies of the class labels specified in label=
in the FASTA headers. --threads
can be adapted according to your system and has no effect on the partitioning itself, it only affects the resource usage and processing speed.
graphpart needle --fasta-file netgpi_dataset.fasta --threshold 0.3 --out-file graphpart_assignments.csv --labels-name label --partitions 5 --threads 12
Alternatively, a train-validation-test split of the data can be made instead of folds:
graphpart needle --fasta-file netgpi_dataset.fasta --threshold 0.3 --out-file graphpart_assignments.csv --labels-name label --test-ratio 0.1 --val-ratio 0.05 --threads 12
Python API
A tutorial notebook showcasing how to use GraphPart from within Python is included at tutorial.ipynb. The tutorial also covers partitioning of small molecule data.
Input format
GraphPart works on FASTA files with a custom header format, e.g.
>P42098|label=CLASSA|priority=0
MAPSWRFFVCFLLWGGTELCSPQPVWQDEGQRLRPSKPPTVMVECQEAQLVVIVSKDLFGTGKLIRPADL
>P0CL66|label=CLASSB|priority=1
MKKYLLGIGLILALIACKQNVSSLDEKNSVSVDLPGEMKVLVSKEKNKDGKYDLIATVDKLELKGTSDKN
Alternatively , ":" and " - " (note there are spaces on either side of the -
) can be used as separators instead of "|". It should be taken care that the sequence identifiers themselves contain no separator symbols. The keywords label
and priority
can be customized by specifying the --labels-name
and --priority-name
arguments. Both elements of the header are optional, GraphPart can also just partition based on sequences alone, without any class balancing.
You can find a script to convert .csv
datasets into the custom .fasta
format at csv_to_fasta.py
Output format
GraphPart produces a .csv
file that contains the cluster assignment for each sequence. Column cluster
contains the partition number. Removed sequences are not contained in the output file.
AC,priority,label-val,between_connectivity,cluster
P42098,False,0,0,0.0
Q6LEM5,False,0,0,0.0
Q9JI81,False,0,0,1.0
API
Minimal command:
graphpart ALIGNMENT_MODE -ff FASTAFILE.fasta -th 0.3
Supported aligners
Alignment mode | Description |
---|---|
needle |
Use EMBOSS needleall to compute exact pairwise global Needleman-Wunsch (NW) identities for all sequences. |
mmseqs2 |
Use MMseqs2 to compute fast identities from local alignments. Use with caution for nucleotides, as there it cannot be guaranteed that MMseqs2 computes all pairwise alignments. |
precomputed |
Use a list of precomputed identities or other similarity/distance metrics. |
mmseqs2needle |
Uses MMseqs2 to compute fast identities, followed by recomputation of NW identities of all identities between the separation threshold and an user-defined lower threshold. |
Arguments
Shared
Long | Short | Description |
---|---|---|
--fasta-file |
-ff |
Path to the input fasta file, formatted according to the input format. |
--out-file |
-of |
Path at which to save the partition assignments as .csv . Defaults to graphpart_result.csv . |
--threshold |
-th |
The desired partitioning threshold, should be within the bounds defined by the metric. |
--partitions |
-pa |
Number of partitions to generate. Defaults to 5. |
--transformation |
-tf |
Transformation to apply to the similarity/distance metric. GraphPart operates on distances, therefore similarity metrics need to be transformed. Can be any of one-minus , inverse , square , log , None . See the source for definitions. As an example, when operating with sequence identities ranging from 0 to 1, the transformation one-minus yields corresponding distances. Defaults to one-minus . |
--priority-name |
-pn |
The name of the retention priority in the meta file. Is either =0 or =1 . If specified, the algorithm first tries to reach the treshold by removing/moving low-priority (0 ) samples before proceeding to 1 samples. |
--labels-name |
-ln |
The name of the label in the meta file. Used for balancing partitions. |
--initialization-mode |
-im |
Use either slow or fast restricted nearest neighbor linkage or no initialization. Can be any of slow-nn , fast-nn , simple . Defaults to slow-nn . |
--no-moving |
-nm |
By default, the removing procedure tries to relocate sequences to another partition if it finds more within-threshold neighbours in any. This flag disallows moving. In high-redundancy datasets, moving can lead to imbalanced partitions and should be disabled. |
--save-checkpoint-path |
-sc |
Optional path to save the computed identities above the chosen threshold as an edge list. Can be used to quickstart runs in the precomputed mode. Defaults to None with no file saved. |
--test-ratio |
-te |
Make a train-val-test split instead of partitions for cross-validation. Overrides --partitions when specified. Defaults to 0. Needs to be a multiple of 0.05. |
--val-ratio |
-va |
Make a train-val-test split instead of partitions for cross-validation. Overrides --partitions when specified. Defaults to 0. Needs to be a multiple of 0.05. |
needle
Long | Short | Description |
---|---|---|
--denominator |
-dn |
Denominator to use for percent sequence identity computation. The number of perfect matching positions is divided by the result of this operation. Can be any of shortest , longest , mean , full , no_gaps . The first three options are computed from the original lengths of the aligned sequences. full refers to the full length of the alignment, including gaps, and is the default. no_gaps subtracts gaps from the full alignment length. |
--threads |
-nt |
The number of threads to run in parallel. If -1 , will use all available resources. Defaults to 1. |
--chunks |
-nc |
The number of chunks into which to split the fasta file for multithreaded alignment. Defaults to 10. |
--parallel-mode |
-pm |
The Python parallelization strategy to use. multithread or multiprocess . Multiprocessing is potentially faster (especially for short sequences), but increases memory usage. Defaults to multithread |
--nucleotide |
-nu |
Use this flag if the input contains nucleotide sequences. By default, assumes proteins. |
--triangular |
-tr |
Only compute triangular of the full distance matrix. Twice as fast, but can yield slightly different results if an alignment has two different solutions with the same score, but different identities. In some cases, a pairwise identity can be slightly above the threshold in one solution, and slightly below in another (e.g A:B = 29.8%, B:A = 30.4%). |
--gapopen |
-gapopen |
[10.0 for any sequence] The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix. (Floating point number from 1.0 to 100.0) |
--gapextend |
-gapextend |
[0.5 for any sequence] The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to a very low value and using the gap extension penalty to control gap scoring. (Floating point number from 0.0 to 10.0) |
--endextend |
-endextend |
[0.5 for any sequence] The end gap extension, penalty is added to the end gap penalty for each base or residue in the end gap. This is how long end gaps are penalized. (Floating point number from 0.0 to 10.0) |
--endweight |
-endweight |
Flag. Apply end gap penalties. |
--endopen |
-endopen |
[10.0 for any sequence] The end gap open penalty is the score taken away when an end gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences. (Floating point number from 1.0 to 100.0) |
--matrix |
-datafile |
This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62'. These files are found in the 'data' directory of the EMBOSS installation. |
mmseqs2
Long | Short | Description |
---|---|---|
--denominator |
-dn |
Denominator to use for percent sequence identity computation. The number of perfect matching positions is divided by the result of this operation. Can be any of shortest , longest , n_aligned . n_aligned is the length of the alignment. Use this with caution, as GraphPart doesn't use coverage controls in the mmseqs2 mode. Defaults to shortest . |
--nucleotide |
-nu |
Use this flag if the input contains nucleotide sequences. By default, assumes proteins. Use with caution! Not guaranteed to compute all pairwise alignments. |
--prefilter |
-pr |
Use MMseqs2 prefiltering at the highest sensitivity instead of forcing computation of all-vs-all alignments. |
precomputed
Long | Short | Description |
---|---|---|
--edge-file |
-ef |
Path to a comma separated file containing precomputed pairwise metrics, the first two columns should contain sequence identifiers specified in the --fasta-file . This is can be used to run GraphPart with an alignment tool different from the default needleall and mmseqs . |
--metric-column |
-mc |
Specifies in which column the metric is found. Indexing starts at 0, defaults to 2 when left unspecified. |
mmseqs2needle
This mode combines needle
and mmseqs2
. After a first run of mmseqs2
, all pairwise alignments with an indentity below recompute-threshold
are computed using needle
. Only then the partitioning is performed. Note that if recompute-threshold
is very low, it can be faster to just run in needle
mode. All arguments for the two modes are reused here, with the following exceptions:
Long | Short | Description |
---|---|---|
--recompute-threshold |
-re |
The threshold for MMseqs2 above which alignments should be recomputed using needleall. Has to be a number lower than --threshold . |
--denominator-mmseqs |
-dnm |
Replaces --denominator . Applies to the mmseqs2 alignment step. |
--denominator-needle |
-dnn |
Replaces --denominator . Applies to the needle alignment step. |
Citation
GraphPart: Homology partitioning for biological sequence analysis
Felix Teufel, Magnús Halldór Gíslason, José Juan Almagro Armenteros, Alexander Rosenberg Johansen, Ole Winther, Henrik Nielsen
bioRxiv 2023.04.14.536886; doi: https://doi.org/10.1101/2023.04.14.536886
FAQ
-
How should I pick
chunks
?
chunks
should be picked so that allthreads
are utilized. Each chunk is aligned to each other chunk, sothreads
<=chunks
*chunks
results in full utilization. -
I want to test multiple thresholds and partitioning parameters - How can I do this efficiently ?
When constructing the graph, we only retain identities that are larger than the selectedthreshold
, as only those form relevant edges for partitioning the data. All other similarities are discarded as they are computed. To test multiple thresholds, the most efficient way is to first try the lowest threshold to be considered and save the edge list by specifying--save-checkpoint-path EDGELIST.csv
. In the next run, usegraphpart precomputed -ef EDGELIST.csv
to start directly from the previous alignment result. -
GraphPart starts with nicely balanced partitions, but after homology removal the sizes are very imbalanced.
By default, GraphPart tries to retain as many sequences as possible. In cases where the initialization clustering is far away from a valid solution (this happens when there are a lot of classes, with potentially small counts, and when there is high overall sequence similarity in the data), moving sequences between partitions will cause some partitions to grow large at the expense of others. You can try--no-moving
to prevent this behaviour. -
I want to use a different similarity metric than sequence identity. GraphPart can be used with any similarity or distance metric. To do so, you need to provide a list of precomputed pairwise similarities in the
precomputed
mode. The first two columns of the file should contain the sequence identifiers specified in the--fasta-file
. The third column should contain the similarity metric. The--metric-column
argument can be used to specify the column index. If you want to use a similarity metric, you need to specify a transformation using--transformation
. See the source for definitions. As an example, when operating with sequence identities ranging from 0 to 1, the transformationone-minus
yields corresponding distances. If your metric is a distance, you can use--transformation None
to skip the transformation step. -
I want to use a different alignment tool than EMBOSS needleall or MMseqs. This is supported by the precomputed mode. Please refer to the answer above.
-
How does the moving step decide to which partition to move a sequence to? After initialization of the partitions, GraphPart iteratively moves sequences between partitions and removes sequences from the data to achieve homology separation. For each sequence, we compute how many connections it has to sequences in each other partition. If there are partitions with more connections than the current partition, the sequence is moved to the partition with the maximum number of connections. If there is a tie in the number of connections, the sequence is moved to the partition that appeared first when iterating the underlying graph data structure. We do not explicitly control this order.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file graph-part-1.0.2.tar.gz
.
File metadata
- Download URL: graph-part-1.0.2.tar.gz
- Upload date:
- Size: 39.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 398ea49d514511eee6b833b5b780be95591f77915f4224fe68ef28ddf1fae8b7 |
|
MD5 | 7c84263630d6f281941a191e1548072e |
|
BLAKE2b-256 | 97a13479af737e18a06aca89434f08bfb4c9ccfe729766e952ac6d4670ac0f5e |
File details
Details for the file graph_part-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: graph_part-1.0.2-py3-none-any.whl
- Upload date:
- Size: 40.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4366ef1bd443cadaef23e99b9bd9e5344df7457bf7c68a3cd0145c7f33ccde00 |
|
MD5 | 3ab0ae042f69d08fca1a357373c9e6d8 |
|
BLAKE2b-256 | c5a7dec7022a955c0b7c83508bd63c2e9f0fbb289717bbaf1cb0fd639e7ec06a |