Alternative method for data clustering using Node2Vec algorithm.
Project description
XNode2Vec - An Alternative Data Clustering Procedure
Description
This repository proposes an alternative method for data classification and clustering, based on the Node2Vec algorithm that is applied to a properly transformed N-dimensional dataset. The original Node2Vec algorithm was replaced with an extremely faster version, called FastNode2Vec. The application of the algorithm is provided by a function that works with networkx objects, that are quite user-friendly. At the moment there are few easy data transformations, but they will be expanded in more complex and effective ones.
Installation
In order to install the Xnode2vec package simply use pip:
pip install Xnode2vec
If there are some problems with the installation, please read the "Note" below.
How to Use
The idea behind is straightforward:
- Take a dataset, or generate one.
- Apply the proper transformation to the dataset.
- Build a networkx object that embeds the dataset with its crucial properties.
- Perform a node classification analysis with Node2Vec algorithm.
import numpy as np
import Xnode2vec as xn2v
import pandas as pd
x1 = np.random.normal(4, 1, 20)
y1 = np.random.normal(5, 1, 20)
x2 = np.random.normal(17, 2, 20)
y2 = np.random.normal(13, 1, 20)
family1 = np.column_stack((x1, y1)) # REQUIRED ARRAY FORMAT
family2 = np.column_stack((x2, y2)) # REQUIRED ARRAY FORMAT
dataset = np.concatenate((family1,family2),axis=0) # Generic dataset
transf_dataset = xn2v.best_line_projection(dataset) # Points transformation
df = xn2v.complete_edgelist(transf_dataset) # Pandas edge list generation
edgelist = xn2v.generate_edgelist(df)
G = nx.Graph()
G.add_weighted_edges_from(edgelist) # Feed the graph with the edge list
nodes, similarity = xn2v.similar_nodes(G, dim=128, walk_length=20, context=5, picked=10, p=0.1, q=0.9, workers=4)
similar_points = xn2v.recover_points(dataset,G,nodes) # Final cluster
Using the same setup as before, let's perform an even more complex analysis:
x1 = np.random.normal(16, 2, 100)
y1 = np.random.normal(9, 2, 100)
x2 = np.random.normal(25, 2, 100)
y2 = np.random.normal(25, 2, 100)
x3 = np.random.normal(2, 2, 100)
y3 = np.random.normal(1, 2, 100)
x4 = np.random.normal(30, 2, 100)
y4 = np.random.normal(70, 2, 100)
family1 = np.column_stack((x1, y1)) # REQUIRED ARRAY FORMAT
family2 = np.column_stack((x2, y2)) # REQUIRED ARRAY FORMAT
family3 = np.column_stack((x3, y3)) # REQUIRED ARRAY FORMAT
family4 = np.column_stack((x4, y4)) # REQUIRED ARRAY FORMAT
dataset = np.concatenate((family1,family2,family3,family4),axis=0) # Generic dataset
df = xn2v.complete_edgelist(dataset) # Pandas edge list generation
df = xn2v.generate_edgelist(df) # Networkx edgelist format
G = nx.Graph()
G.add_weighted_edges_from(df)
nodes_families, unlabeled_nodes = xn2v.clusters_detection(G, cluster_rigidity = 0.85, spacing = 15, dim_fraction = 0.8,
picked=100,dim=100,context=5,Weight=True, walk_length=20)
points_families = []
points_unlabeled = []
for i in range(0,len(nodes_families)):
points_families.append(xn2v.recover_points(dataset,G,nodes_families[i]))
points_unlabeled = xn2v.recover_points(dataset,G,unlabeled_nodes)
plt.scatter(dataset[:,0], dataset[:,1])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Generic Dataset', fontweight='bold')
plt.show()
Now the list points_families
contains the four clusters -- clearly taking in account possible statistical errors. The results are however surprisingly good in many situations.
Results
The analysis prints out on the terminal automatically:
- Number of clusters found.
- Number of nodes analyzed.
- Number of clustered nodes.
- Number of non-clustered nodes.
- Number of nodes in each cluster.
The output is something of this type:
--------- Clusters Information ---------
- Number of Clusters: 5
- Total nodes: 400
- Clustered nodes: 251
- Number of unlabeled nodes: 149
- Nodes in cluster 1: 16
- Nodes in cluster 2: 52
- Nodes in cluster 3: 83
- Nodes in cluster 4: 64
- Nodes in cluster 5: 36
The clustered objects are stored into a list of numpy vectors that are returned by the function clusters_detection(). It's important to get used to the parameter selection that determines the criteria with which the nodes are labeled.
Objects Syntax
Here we report the list of structures required to use the Xnode2vec package:
- Dataset:
dataset = np.array([[1,2,3,..], ..., [1,2,3,..]])
; the rows corresponds to each point, while the coulumns to the coordinates. - Edge List:
edgelist = [(node_a,node_b,weight), ... , (node_c,node_d,weight)]
; this is a list of tuples, structured as [starting_node, arriving_node, weight] - DataFrame:
pandas.DataFrame(np.array([[1, 2, 3.7], ..., [2, 7, 12]]), columns=['node1', 'node2', 'weight'])
Note
- 9/17/2021: I had some issues when installing the fastnode2vec package; in particular, the example given by Louis Abraham gives an error. I noticed that after the installation, the declaration of the file "node2vec.py" wasn't the same as the latest version available on its GitHub (at the moment). My brutal solution was simply to just copy the whole content into the node2vec.py file. This solves the problem.
- 9/17/2021: Numba requires numpy <= 1.20 in order to work.
Examples
Generic Applications
Most Similar Nodes | Similar Nodes Distribution | Community Network | Hi-C Translocation Detection |
---|---|---|---|
Clustering Test
Blops | Moons | Circles | Swiss Roll |
---|---|---|---|
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.