Unsupervised Graph Analysis Framework.
Project description
NEExT
Network Embedding Exploration Tool
NEExT is a tool for exploring and building graph embeddings. This tool allows for:
- Cleansing and standardizing a collection of graph data.
- Creating node and structural features for nodes in the graph collection.
- Creating embeddings for graphs.
Installation Process
NEExT uses Python 3.x (currently tested using Python 3.11). You can install NEExT using the following:
pip install NEExT
Graph Data Format
You can use a few different data formats to upload data into NEExT. Currently, it allows for:
- CSV files
- NetworkX Objects (comming soon) See below for examples of using different data formats.
Using CSV Files
Data can be categorized into the following groups:
- Edge File (captures which nodes are connected to which nodes)
- Node Graph Mapping (captures which belongs to which graph)
- Graph Label Mapping [optional] (captures labels for each graph)
- Node Features [optional] (captures the features for each node)
Below we show example of how each of the above files should be formatted:
Edge File:
node_a | node_b |
---|---|
1 | 2 |
3 | 2 |
. | . |
Node Graph Mapping:
node_id | graph_id |
---|---|
0 | 1 |
1 | 1 |
2 | 1 |
3 | 2 |
4 | 2 |
. | . |
Graph Label Mapping:
graph_id | graph_label |
---|---|
0 | 0 |
1 | 0 |
2 | 1 |
3 | 0 |
4 | 1 |
. | . |
Node Features:
node_id | node_feat_0 | node_feat_1 | ... |
---|---|---|---|
0 | 0.34 | 3.2 | . |
1 | 0.1 | 2.9 | . |
2 | 1.9 | 1.3 | . |
3 | 0.0 | 2.2 | . |
4 | 11.2 | 12.3 | . |
. | . | . | . |
Note that NEExT can not handle non-numerical features. Some feature engineering on the node features must be done by the end-user. Data standardization, however, will be done.
NEExT Tutorial [Getting Started]
In this notebook, we showcase how to use NEExT to analyze graph embeddings.
from NEExT.NEExT import NEExT
The following are link to some graph data, which we will use in this tutorial. Note that we have Graph Labels in this dataset, which are optional data, for using NEExT. The datasets were genearted using the ABCD Framework found here (https://github.com/bkamins/ABCDGraphGenerator.jl)
Loading Data
First we deine a path to the datasets. They are csv
files, with format as defined in the README file.
edge_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/edge_file.csv"
graph_label_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/graph_label_mapping_file.csv"
node_graph_mapping_file = "https://raw.githubusercontent.com/elmspace/ugaf_experiments_data/main/abcd/xi_n/node_graph_mapping_file.csv"
Now we can instantiate a NEExT object.
nxt = NEExT(quiet_mode="on")
You can load data using the load_data_from_csv
method:
nxt.load_data_from_csv(edge_file=edge_file, node_graph_mapping_file=node_graph_mapping_file, graph_label_file=graph_label_file)
Building Features
You can now compute various features on nodes of the subgraphs in the graph collection loaded above.
This can be done using the method compute_graph_feature
.
To get the list of available node features, you can use the function get_list_of_graph_features
.
nxt.get_list_of_graph_features()
['lsme',
'self_walk',
'basic_expansion',
'basic_node_features',
'page_rank',
'degree_centrality',
'closeness_centrality',
'load_centrality',
'eigenvector_centrality']
These are the type of node features you can compute on every node on each graph in the graph collection.
So for example, let's compute page_rank
. We also need to defined what the feature vector size should be.
nxt.compute_graph_feature(feat_name="page_rank", feat_vect_len=4)
To compute additional features, simply use the same function, and provide the length of the vector size.
Let's add degree centrality to the list of computed features.
nxt.compute_graph_feature(feat_name="degree_centrality", feat_vect_len=4)
Building Global Feature Object
Right now, we have 2 features computed on every node, for every graph. We can use these features to construct a overall pooled feature vector, which can be used to construct graph embeddings.
To do this, we can pool the features using the pool_grpah_features
method.
nxt.pool_graph_features(pool_method="concat")
The overall feature (which we call global feature) is a concatenated vector of whatever features you have computed on the graph. In this example it would be a 8 dimensional vector of page_rank
and degree_centrality
.
You can access the global vector by using the get_global_feature_vector
method.
df = nxt.get_global_feature_vector()
df.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
node_id | graph_id | feat_degree_centrality_0 | feat_degree_centrality_1 | feat_degree_centrality_2 | feat_degree_centrality_3 | feat_page_rank_0 | feat_page_rank_1 | feat_page_rank_2 | feat_page_rank_3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 4.094288 | 1.632019 | 1.723672 | 2.023497 | 4.014656 | 1.645432 | 1.825315 | 2.003575 |
1 | 1 | 0 | 2.682074 | 2.024244 | 1.689427 | 2.023497 | 2.651835 | 1.999918 | 1.745939 | 2.042548 |
2 | 2 | 0 | 2.682074 | 1.915292 | 1.578132 | 2.120736 | 2.672592 | 1.917080 | 1.696518 | 2.058271 |
Dimensionality Reduction
We may wish to reduce the number of dimensions of our data, which could help downstream tasks such as Embedding generation or machine learning tasks. This can be done using the apply_dim_reduc_to_graph_feats
.
nxt.apply_dim_reduc_to_graph_feats(dim_size=4, reducer_type="pca")
If we take a look at the global feature vector
we can see that it is upaded with the new size of dimension.
df = nxt.get_global_feature_vector()
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
node_id | graph_id | feat_0 | feat_1 | feat_2 | feat_3 | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 2.471714 | 3.577450 | 0.394070 | 0.779143 |
1 | 1 | 0 | 2.232913 | 1.420164 | 0.969629 | 0.912235 |
2 | 2 | 0 | 2.202837 | 1.494916 | 0.809437 | 1.537148 |
3 | 3 | 0 | 2.102230 | 0.403983 | 0.199739 | -0.931054 |
4 | 4 | 0 | 2.164103 | 0.202613 | 2.194223 | 3.052554 |
You still have access to the pre-dimensionality reduction global vector by using the method get_archived_global_feature_vector
.
df = nxt.get_archived_global_feature_vector()
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
node_id | graph_id | feat_degree_centrality_0 | feat_degree_centrality_1 | feat_degree_centrality_2 | feat_degree_centrality_3 | feat_page_rank_0 | feat_page_rank_1 | feat_page_rank_2 | feat_page_rank_3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 4.094288 | 1.632019 | 1.723672 | 2.023497 | 4.014656 | 1.645432 | 1.825315 | 2.003575 |
1 | 1 | 0 | 2.682074 | 2.024244 | 1.689427 | 2.023497 | 2.651835 | 1.999918 | 1.745939 | 2.042548 |
2 | 2 | 0 | 2.682074 | 1.915292 | 1.578132 | 2.120736 | 2.672592 | 1.917080 | 1.696518 | 2.058271 |
3 | 3 | 0 | 1.975967 | 1.993115 | 2.082671 | 1.851304 | 1.968745 | 1.937933 | 2.028736 | 1.879435 |
4 | 4 | 0 | 1.975967 | 2.491178 | 1.355541 | 2.346133 | 1.940827 | 2.407239 | 1.384500 | 2.274468 |
Building Graph Embeddings
This function returns a Pandas DataFrame, with the collection features and how they map to the graphs and nodes.
One thing to note is that the data is standardized across all graphs.
We can use the features computed on the graphs to build graph embeddings. To see what graph embedding engines are available to use, we can use the get_list_of_graph_embedding_engines
function.
nxt.get_list_of_graph_embedding_engines()
['approx_wasserstein', 'wasserstein', 'sinkhornvectorizer']
Now, let's build a 3 dimensional embedding for every graph in graph collection using the Approximate Wasserstein embedding engine. This can be done by using the method build_graph_embedding
.
nxt.build_graph_embedding(emb_dim_len=3, emb_engine="approx_wasserstein")
You can access the embedding results by using the method get_graph_embeddings
.
df = nxt.get_graph_embeddings()
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
emb_0 | emb_1 | emb_2 | graph_id | |
---|---|---|---|---|
0 | 2.038486 | 1.463379 | 0.080776 | 0 |
1 | 0.874913 | 1.535265 | 0.475480 | 1 |
2 | 0.021950 | 0.849217 | -0.418307 | 2 |
3 | -0.726050 | 0.750470 | -0.317739 | 3 |
4 | -1.313531 | 0.656964 | 0.077666 | 4 |
Visualize Embeddings
You can use the builtin visualization function to gain quick insights into the performance of your embeddings. This can be done by using the method visualize_graph_embedding
. If you have labels for your graph (like the case here), we can color the embedding distributions using the labels. By default, embeddings are not colored.
Using Sampled Sub-Graphs
We may often have to deal with large graphs, both in the number of sub-graphs in the collection, and also the size of each graph. To allow for faster computation, we can sample each sub-graph and compute metrics and features for a fraction of nodes on each sub-graph. This can be done by using the method build_node_sample_collection
. It takes as input the fraction of sampled nodes. Once this method is called all further computation will use the sampled node collection.
nxt.build_node_sample_collection(sample_rate=0.1)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.