Skip to main content

Unsupervised Graph Analysis Framework.

Project description

NEExT

Network Embedding Exploration Tool

NEExT is a tool for exploring and building graph embeddings. This tool allows for:

  • Cleansing and standardizing a collection of graph data.
  • Creating node and structural features for nodes in the graph collection.
  • Creating embeddings for graphs.

Installation Process

NEExT uses Python 3.x (currently tested using Python 3.11). You can install NEExT using the following:

pip install NEExT

Graph Data Format

You can use a few different data formats to upload data into NEExT. Currently, it allows for:

  • CSV files
  • NetworkX Objects (comming soon) See below for examples of using different data formats.

Using CSV Files

Data can be categorized into the following groups:

  • Edge File (captures which nodes are connected to which nodes)
  • Node Graph Mapping (captures which belongs to which graph)
  • Graph Label Mapping [optional] (captures labels for each graph)
  • Node Features [optional] (captures the features for each node)

Below we show example of how each of the above files should be formatted:

Edge File:
node_a node_b
1 2
3 2
. .

Node Graph Mapping:

node_id graph_id
0 1
1 1
2 1
3 2
4 2
. .

Graph Label Mapping:

graph_id graph_label
0 0
1 0
2 1
3 0
4 1
. .

Node Features:

node_id node_feat_0 node_feat_1 ...
0 0.34 3.2 .
1 0.1 2.9 .
2 1.9 1.3 .
3 0.0 2.2 .
4 11.2 12.3 .
. . . .

Note that NEExT can not handle non-numerical features. Some feature engineering on the node features must be done by the end-user. Data standardization, however, will be done.

NEExT Tutorial [Getting Started]

In this notebook, we showcase how to use NEExT to analyze graph embeddings.

from NEExT.NEExT import NEExT

The following are link to some graph data, which we will use in this tutorial. Note that we have Graph Labels in this dataset, which are optional data, for using NEExT. The datasets were genearted using the ABCD Framework found here (https://github.com/bkamins/ABCDGraphGenerator.jl)

Loading Data

First we deine a path to the datasets. They are csv files, with format as defined in the README file.

edge_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/edge_file.csv"
graph_label_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/graph_label_mapping_file.csv"
node_graph_mapping_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/node_graph_mapping_file.csv"

Now we can instantiate a NEExT object.

nxt = NEExT(quiet_mode="on")

You can load data using the load_data_from_csv method:

nxt.load_data_from_csv(edge_file=edge_file, node_graph_mapping_file=node_graph_mapping_file, graph_label_file=graph_label_file)

Building Features

You can now compute various features on nodes of the subgraphs in the graph collection loaded above.
This can be done using the method compute_graph_feature.
To get the list of available node features, you can use the function get_list_of_graph_features.

nxt.get_list_of_graph_features()
['lsme',
 'self_walk',
 'basic_expansion',
 'basic_node_features',
 'page_rank',
 'degree_centrality',
 'closeness_centrality',
 'load_centrality',
 'eigenvector_centrality']

These are the type of node features you can compute on every node on each graph in the graph collection.
So for example, let's compute page_rank. We also need to defined what the feature vector size should be.

nxt.compute_graph_feature(feat_name="page_rank", feat_vect_len=4)

To compute additional features, simply use the same function, and provide the length of the vector size.
Let's add degree centrality to the list of computed features.

nxt.compute_graph_feature(feat_name="degree_centrality", feat_vect_len=4)

Building Global Feature Object

Right now, we have 2 features computed on every node, for every graph. We can use these features to construct a overall pooled feature vector, which can be used to construct graph embeddings.
To do this, we can pool the features using the pool_grpah_features method.

nxt.pool_graph_features(pool_method="concat")

The overall feature (which we call global feature) is a concatenated vector of whatever features you have computed on the graph. In this example it would be a 8 dimensional vector of page_rank and degree_centrality.
You can access the global vector by using the get_global_feature_vector method.

df = nxt.get_global_feature_vector()
df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
node_id graph_id feat_degree_centrality_0 feat_degree_centrality_1 feat_degree_centrality_2 feat_degree_centrality_3 feat_page_rank_0 feat_page_rank_1 feat_page_rank_2 feat_page_rank_3
0 0 0 4.094288 1.632019 1.723672 2.023497 4.014656 1.645432 1.825315 2.003575
1 1 0 2.682074 2.024244 1.689427 2.023497 2.651835 1.999918 1.745939 2.042548
2 2 0 2.682074 1.915292 1.578132 2.120736 2.672592 1.917080 1.696518 2.058271

Dimensionality Reduction

We may wish to reduce the number of dimensions of our data, which could help downstream tasks such as Embedding generation or machine learning tasks. This can be done using the apply_dim_reduc_to_graph_feats.

nxt.apply_dim_reduc_to_graph_feats(dim_size=4, reducer_type="pca")

If we take a look at the global feature vector we can see that it is upaded with the new size of dimension.

df = nxt.get_global_feature_vector()
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
node_id graph_id feat_0 feat_1 feat_2 feat_3
0 0 0 2.471714 3.577450 0.394070 0.779143
1 1 0 2.232913 1.420164 0.969629 0.912235
2 2 0 2.202837 1.494916 0.809437 1.537148
3 3 0 2.102230 0.403983 0.199739 -0.931054
4 4 0 2.164103 0.202613 2.194223 3.052554

You still have access to the pre-dimensionality reduction global vector by using the method get_archived_global_feature_vector.

df = nxt.get_archived_global_feature_vector()
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
node_id graph_id feat_degree_centrality_0 feat_degree_centrality_1 feat_degree_centrality_2 feat_degree_centrality_3 feat_page_rank_0 feat_page_rank_1 feat_page_rank_2 feat_page_rank_3
0 0 0 4.094288 1.632019 1.723672 2.023497 4.014656 1.645432 1.825315 2.003575
1 1 0 2.682074 2.024244 1.689427 2.023497 2.651835 1.999918 1.745939 2.042548
2 2 0 2.682074 1.915292 1.578132 2.120736 2.672592 1.917080 1.696518 2.058271
3 3 0 1.975967 1.993115 2.082671 1.851304 1.968745 1.937933 2.028736 1.879435
4 4 0 1.975967 2.491178 1.355541 2.346133 1.940827 2.407239 1.384500 2.274468

Building Graph Embeddings

This function returns a Pandas DataFrame, with the collection features and how they map to the graphs and nodes.
One thing to note is that the data is standardized across all graphs.

We can use the features computed on the graphs to build graph embeddings. To see what graph embedding engines are available to use, we can use the get_list_of_graph_embedding_engines function.

nxt.get_list_of_graph_embedding_engines()
['approx_wasserstein', 'wasserstein', 'sinkhornvectorizer']

Now, let's build a 3 dimensional embedding for every graph in graph collection using the Approximate Wasserstein embedding engine. This can be done by using the method build_graph_embedding.

nxt.build_graph_embedding(emb_dim_len=3, emb_engine="approx_wasserstein")

You can access the embedding results by using the method get_graph_embeddings.

df = nxt.get_graph_embeddings()
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
emb_0 emb_1 emb_2 graph_id
0 2.038486 1.463379 0.080776 0
1 0.874913 1.535265 0.475480 1
2 0.021950 0.849217 -0.418307 2
3 -0.726050 0.750470 -0.317739 3
4 -1.313531 0.656964 0.077666 4

Visualize Embeddings

You can use the builtin visualization function to gain quick insights into the performance of your embeddings. This can be done by using the method visualize_graph_embedding. If you have labels for your graph (like the case here), we can color the embedding distributions using the labels. By default, embeddings are not colored.

Using Sampled Sub-Graphs

We may often have to deal with large graphs, both in the number of sub-graphs in the collection, and also the size of each graph. To allow for faster computation, we can sample each sub-graph and compute metrics and features for a fraction of nodes on each sub-graph. This can be done by using the method build_node_sample_collection. It takes as input the fraction of sampled nodes. Once this method is called all further computation will use the sampled node collection.

nxt.build_node_sample_collection(sample_rate=0.1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

NEExT-0.2.6.tar.gz (18.5 kB view hashes)

Uploaded Source

Built Distribution

NEExT-0.2.6-py3-none-any.whl (20.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page