Boolean Network To Vector

Project description

bn2vec

Boolean Networks' embedding techniques & ML based Boolean Networks' classification.

0. Introduction:

bn2vec is the result of an earlier research which has been conducted in 2021 (Mar-Sept) as part of a larger project named BNediction, the goal of the research was fixated on developing new embedding techniques specifically built for dealing with Boolean Networks, the aim though was to use these techniques to help classifiy Boolean Networks and develop a solid set of features which would be able to explain the performance of a given BN.

the full master's thesis report which wraps the work done in this package could be found in Master's Thesis, any details regarding how the embedding or the classification work are discussed in the report.

for a walk through example please check test.ipynb.

1. Setting up:

step 1. creating a new virtual env.

python -m venv env

for a manual setup we should install the packages from the requirements.txt file and then install bn2vec using pip.

pip install -r requirements.txt

pip install -e .

3. Config:

when creating a ConfigParser object you will be asked to feed the path for your configuration file, the file should be a yaml type and it should conform to the validation rules for it to be used, in case of abscense of the config file a default file (allow-all) would be used, see Default Config File.

under section Memory 6 options are allowed:

memorize_dnf_graphs (resp. _bn_graphs): if set to true it allows for remembering graphs' data generated from DNFs (resp. BNs).
memorize_dnf_sequences (resp. _bn_sequences): if set to true it allows for remembering sequences' data generated from DNFs (resp. BNs).
hard_memory: if set to true it allows for storing the generated data from an ensemble of BNs into the desk.
hard_memory_loc: the path folder for hard_memory.

under section Embeddings we can specify any of the following options:

rsf: stands for Relaxed Structural Features, if specified the system generates RSF features of the given ensemble of BNs.
lsf: stands for Lossy Structural Features, if specified the system generates LSF features of the given ensemble of BNs.
ptnrs: acronym of Patterns, if specified the system generates PTRNS features of the given ensemble of BNs.
igf: stands for Influence Graph Features, if specified the system generates IGF features of the given ensemble of BNs.

for more details about the rest of the file please have a look at the Default File and the Full Report.

4. Embeddings:

let us have a look at the different ways of using the feature engineering module.
necessary imports:

from colomoto import minibn

from bn2vec.feature_engineering import Dnf2Vec, Bn2Vec, Ens2Mat
from bn2vec.utils import ConfigParser

in the case of using Dnf2Vec (embedding a single DNF) or Bn2Vec (embedding a single BN, ensemble of DNFs), we have to tell the system to parse the config file ourselves.

ConfigParser.parse("path/to/configfile")

we use minibn.BooleanNetwork to parse Boolean Networks' files.

bn = minibn.BooleanNetwork("path/to/boolean_network")
BN = list(bn.items())

then when using Dnf2Vec we can perform the embedding to one of the BN's DNFs this way.

gen = Dnf2Vec(dnf=BN[0][1], comp_name=BN[0][0])
graphs, seqs, features = gen.generate_features()

the generate_features method returns three objects:

graphs (resp. seqs): is a dictionary containing the graphs' (res. sequences') data of the given DNF.(if asked for).
features: is a pandas Series object containing the final features extracted from the given DNF.

likewise we can embed the whole BN.

gen = Bn2Vec(BN)
bn_graphs, bn_seqs, dnfs_data, bn_features = gen.generate_features()

this time we have more complicated semi-structed data to look at:

bn_graphs (resp. bn_seqs): is a dictionary containing the graphs' (res. sequences') data of the given BN.(if asked for).
dnfs_data: contains dnf graphs, sequences and features generated by Dnf2Vec for all dnfs in the given BN.
bn_features: is a pandas Series object containing the final features extracted from the given BN.

if we want to embed an ensemble of BNs we simply use Ens2Mat (ensemble to matrix).

gen = Ens2Mat(
    config_path='path/to/config_file',
    master_model_src = 'path/to/master_model'
)

X,Y = gen.vectorize_BNs(
    'path/to/base_directory',
    '', # bundle file name (under base_directory)
    size = 'all' # or an integer (the number of BNs to embed)
)

5. Features Selector:

in order to use BnFeaturesSelector we should import one extra module:

from bn2vec.feature_selection import BnFeaturesSelector

this module has 3 main methods:

drop_zero_variance_features: literally removes features without any flactuations.
cluster_collinear_features_leiden: uses the leiden algorithm to cluster features based on their collinearities, then the method selects the best representative feature from each cluster, this method is only useful in the case of LSF and RSF (mostly LSF where elminiating collinearities is important but also deciding which to remove is more important).
correct_collinearity: takes a set of features and then returns another set of features (with high collinearity with the input features) which are better explainable than the originals.

selector = BnFeaturesSelector(X, mode='lsf')
X = selector.drop_zero_variance_features()
X, clusters = selector.cluster_collinear_features_leiden(thresh = 0.8)

the argument thresh is the threshold (minimal value) to decide that two features are correlated, it is calulated as the absolute value of the correlation value between the two features.

6. Rules Extractor:

necessary imports for using the rules extraction module:

from bn2vec.utils import BnDataset
from bn2vec.rules_extraction import DTC, RulesExtractor

creating a BnDataset object is necessary:

base_dir='path/to/base_directory'
BN = BnDataset(
    dataset_X = os.path.join(base_dir, 'path/to/X_file'), 
    dataset_Y= os.path.join(base_dir, 'path/to/Y_file'),
    score_threshold = 1
)

then we can create our DTC (stands for Decision Tree Classifier) object:

dtc = DTC(
    dataset = BN,
    save_dir = "path/to/saving_directory", 
    ensemble="ens1",
    embedding="ptrns"
)

the arguments 'ensemble' and 'embedding' are there just for naming conventions, to train deep decision tree classifiers we use the train_deep_dtcs method:

dtc.train_deep_dtcs(test_size=0.3)

this well train a balanced and an unbalanced version of the tree, it will save the trees and the metrics in the save_dir folder and it will print the metrics for visual inspection.

in order to extract useful rules from these trees we should use the RulesExtractor class:

rule_extractor = RulesExtractor(
    dataset = BN,
    dtc = "path/to/dtc",
)
rules = rule_extractor.extract_rules(
    thresh = 0,
    tpr_weight = 0.5, # importance of the true positive rate
    tnr_weight = 0.5  # importance of the true negative rate
)

for training singleton decision trees (trees with a single split) we use train_singleton_dtcs:

rules = dtc.train_singleton_dtcs(
    test_size=0.3,
    balanced=False,
    thresh=0.5,
    tpr_weight=0.5,
    tnr_weight=0.5
)

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Apr 11, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bn2vec-1.0.0.tar.gz (25.1 kB view hashes)

Uploaded Apr 11, 2022 Source

Built Distribution

bn2vec-1.0.0-py3-none-any.whl (28.0 kB view hashes)

Uploaded Apr 11, 2022 Python 3

Hashes for bn2vec-1.0.0.tar.gz

Hashes for bn2vec-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f01e394f011263d4c1303ba087d7ea58f08645c29dfc4786f36d7d44c315768f`
MD5	`8da29acb89802e668350b9235ab60157`
BLAKE2b-256	`69f67e70b63e29f5caf739eedd54951c76c4f3ca055c2af5600968c7784727f7`

Hashes for bn2vec-1.0.0-py3-none-any.whl

Hashes for bn2vec-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5366d6387d39e7eeb06fdb5671b4f56cfb6492eb76bd73e5eb3f5a024ab1c4d5`
MD5	`04b1978d3088f9559a580c7c1f230274`
BLAKE2b-256	`54434739b7091c61127eda54e74dbb2a72e544a483de55f0f13a070f1005aba9`