Skip to main content

MASSA Algorithm is a Python package to separate data sets of molecules into training and test sets, considering the diversity of structural, physicochemical and biological characteristics of these molecules.

Project description

MASSA Algorithm

MASSA Algorithm: A tool for separating data sets of molecules into training and test sets. Developed with the objective of preparing data sets for the generation of prediction models in cheminformatics.

Version 1.0.0

  • Version 1.0.0 includes a code refactoring and adds the ability to split data into training, test, and validation subsets.

Version 2.0.0

  • Version 2.0.0 introduces algorithm changes to better handle large datasets. Specifically, for datasets with 10,000 molecules or more, Hierarchical Clustering Analysis (HCA) is automatically replaced by MiniBatch K-Means. This is the new default behavior, but it can be controlled using the flag -a / --large-datasets. Since dendrograms are generated from HCA, dendrogram plotting is disabled for large datasets. In such cases, the flag -f / --dendrogram-plot is automatically disabled, regardless of user input.

  • This version introduces the ability to call the MASSA Algorithm from any Python script using MASSA_Algorithm.pyMASSA.py_massa(). However, instructions and documentation for doing so will be provided in future updates.

  • Version 2.0.0 also addresses the "NaN/None values in y" error, which was caused by duplicate molecule names. In this version, molecules with identical names are detected and renamed by appending their original index as a suffix to ensure uniqueness.

  • MASSA now preserves the input order of molecules in its output. However, molecules with chemical errors are skipped and excluded from the process.

Instalation

MASSA Algorithm can be installed using pip:

pip install MASSA_Algorithm

To upgrade to the latest version (recommended), also use pip:

pip install --upgrade MASSA_Algorithm

Alternatively, you can build the latest development version from source:

git clone https://github.com/gcverissimo/MASSA_Algorithm.git
cd MASSA_Algorithm
python setup.py install

Requirements

  • python: >= 3.8;
  • rdkit;
  • numpy: < 2.0;
  • pandas;
  • matplotlib: >= 3.2;
  • scipy: >= 1.6;
  • scikit-learn: > 0.24;
  • kmodes:¹ >= 0.10.

Newer versions of the packages may also work, but they need to be tested. NOTE: Also tested on: scikit-learn: 1.7.0, scipy: 1.16.1, numpy: 2.0.2, rdkit: 2025.03.5.

Usage

Once installed, the program can be run directly from the command line:

MASSA_Algorithm -i <input_file>.sdf -o <output_file>.sdf

Required arguments:

  • Input file: -i or --input.
    • MASSA Algorithm accepts input files in the formats: .sdf, .mol, .mol2, .xlsx, .xls, and .csv. However, the .sdf format is preferred. Notes:
      • .mol2 files have limitations in storing molecular properties and follow different saving patterns. Due to this, we only support .mol2 files that are generated by Discovery Studio Visualizer.
      • For .xlsx, .xls, and .csv:
        • MASSA will look for a column with smiles in the name and use it as the source of input molecules.
        • MASSA will look for a column with one of the following names in the priority order: Molecule name, Name, Molecule ChEMBL ID, or ID. If none of these columns are found, it will use the molecule index.
  • Output file: -o or --output.
    • Enter the output file name or file path. Image files will be saved to a folder within the same directory as the output file.
    • It is highly recommended to use an .sdf file to avoid errors.

Optional arguments include:

  • Splitting strategy: -y or --splitting_strategy.
    • Defines the splitting strategy, either into training and test sets or into training, test, and validation sets.
    • Options = 'tt', 'tt-val'.
    • 'tt', means splitting into training and test sets. 'tt-val', means splitting into training, test, and validation sets.
    • Default = 'tt'.
  • Percentage of molecules in training set: -p or --percentage_of_training.
    • Percentage of molecules in training set. Must be a number from 0 to 1.
    • Default = 0.8.
  • Number of biological activities for separation: -b or --number_of_biological.
    • Number of biological activities that will be used to separate the set into training and test.
    • Default = 1.
  • Name of biological activities for separation: -s or --the_biological_activities.
    • The biological activity(ies) or other y-property(ies) used for QSAR or machine learning modeling. The algorithm can handle one or more properties. Enter a list with the names of biological activities separated by commas and no spaces. These properties must be represented as either integers or floating-point numbers. If your dataset is represented in classes (e.g., active or inactive; soluble, partially soluble, or insoluble), you should represent them as integers (e.g., 0 or 1; 1, 2, or 3).
    • Example: MASSA_Algorithm -i <input_file>.sdf -o <output_file>.sdf -s pIC50,pMIC.
    • Default = If not entered directly on the command line, it will be requested during algorithm execution.
  • Number of principal components in PCA: -n or --number_of_PCs.
    • Defines the number of principal components to reduce the dimensionality of variables related to biological, physicochemical and structural domains. If the value is a decimal between 0 and 1, the number of principal components is what explains for (<input number>* 100)% of the variance. If the value is greater than 1, the number of PCs will be exactly the input integer, but PAY ATTENTION:

      1. If the number of PCs is an integer and equal to or greater than the number of physicochemical properties (7), the PCA step will be bypassed for this domain.
      2. The same for the biological domain.
      3. If the number of biological activities is less than 3, the PCA step will be bypassed for this domain.
    • Default = 0.85.

  • SVD solver parameter for PCA: -v or --svd_solver_for_PCA.
  • HCA linkage method: -l or --linkage_method.
  • Extension of image files: -t or --image_type.
    • Extension of the image files that will be generated. Suggested = png or svg.
    • Default = png.
  • Font size for X-axis of dendrograms: -d or --dendrogram_Xfont_size.
    • Sets the font size on the x-axis of the dendrogram (molecule labels).
    • Default = 5.
  • Font size for X-axis of bar plots: -x or --barplot_Xfont_size.
    • Sets the font size on the x-axis of the bar plot (cluster labels).
    • Default = 12.
  • Enable Dendrogram plot: -f or --dendrogram_plot.
    • Defines whether or not dendrogram images will be generated.
    • Options = true (dendrogram will be generated), false (dendrogram will not be generated).
    • Default = true.
  • Ignore Errors: -e or --drop_errors.
    • Ignore chemistry errors, saving only molecules without any errors.
    • Options = true (ignore molecule errors and log them in the log file), false (treat molecule errors as fatal and fail the execution).
    • Default = true.
  • Large Dataset: -a or --large_dataset.
    • Switch the MASSA algorithm from HCA to MiniBatch-KMeans to handle large datasets.
    • Options = auto (switches to KMeans if ≥ 10,000 molecules), false = use HCA (for smaller datasets), true = use KMeans (for large datasets).
    • Default = auto.

Command line help

A full description of the arguments can also be viewed directly from the command line using the command:

MASSA_Algorithm -h

or

MASSA_Algorithm --help

Cite

  1. To ensure accurate citation, kindly include a reference to the MASSA article, accessible via the DOI: Veríssimo, G. C.; Panteleão, S. Q.; Fernandes, P. O.; Gertrudes, J. C.; Kronenberger, T.; Honorio, K. M.; Maltarollo; V. G. MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling. J. Comput. Aided Mol. Des. 2023, 37, 735–754. https://doi.org/10.1007/s10822-023-00536-y.

  2. Furthermore, please incorporate the program citation following the provided template:

@Misc{veríssimo2021,
    author = {Gabriel Corrêa Veríssimo},
    title = {MASSA Algorithm: Molecular data set sampling for training-test separation},
    howpublished = {\url{https://github.com/gcverissimo/MASSA_Algorithm}},
    year = {2021}
  }

References

[1]: DE VOS, N. J. kmodes categorical clustering library. https://github.com/nicodv/kmodes. 2015-2021.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

massa_algorithm-2.0.1.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

massa_algorithm-2.0.1-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file massa_algorithm-2.0.1.tar.gz.

File metadata

  • Download URL: massa_algorithm-2.0.1.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for massa_algorithm-2.0.1.tar.gz
Algorithm Hash digest
SHA256 c6cd31a1d5bc032380fc692992c4ca0bfb51e23815926344189f1e147330dd64
MD5 21cd2e42c6be4b55b8b88c336ee5e2b2
BLAKE2b-256 7de0b0d596ec7f64bb5fdbabc6a2d056aa5e7d9a60a6a8f473e6c0d73cb23c70

See more details on using hashes here.

File details

Details for the file massa_algorithm-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for massa_algorithm-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9c18b62273e937ac2dc931c7b995009f5e34e104e6fbff7eac14348b35552666
MD5 ec270b4c2df4e847b0b8f70a162cc060
BLAKE2b-256 1683b4694ec84fbeeb17746735d716558fef48996e99f20dc2a63746baa88ab5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page