Skip to main content

Recommendation algorithms for large graphs on networkx

Project description

pygrank

Recommendation algorithms for large graphs.

Table of Contents

Installation

pip install pygrank

Usage

Ranking Algorithms

Ranking algorithms assume that there exists a networkx graph G and a non-negative personalization dictionary of node importances (missing nodes are considered to have 0 importance). For example, if we consider a list edges of edge tuples and a list seedS of nodes towards which to consider structural importance, the graph and personalization can be constructed as:

import networkx as nx
edges, seeds = ...
G = nx.Graph()
for u, v in edges:
    G.add_edge(u, v)
personalization = {v: 1 for v in seeds}

Given the above way of creating a graph and a personalization dictionary (which is the programming artifact equivalent to what the literature referred to as a graph signal or personalization vector), one can instantiate a ranking algorithm class and call its rank(G, personalization) method to capture node ranks in a given graph with the given personalization. This method exists for all ranking algorithms and potential wrappers that aim to augment ranks in some way (see Improving Ranking Outcome). Algorithmic parameters, such as convergence critertia and the type of adjacency matrix normalization, are passed to the constructor of the ranking algorithm. As an example, running a personalized PageRank algorithm with diffusion rate alpha=0.85 with absolute rank error tolerance tol=1.E-6 (its default parameters) and printing its node ranks can be done as follows:

from pygrank.algorithms.pagerank import PageRank
G, personalization = ...
ranker = PageRank(alpha=0.85, tol=1.E-6)
ranks = ranker.rank(G, personalization)
print('Convergence report:', str(ranker.convergence))
for v, rank in ranks.items():
    print('The rank of node', v, 'is', rank)

:bulb: For general-purpose usage, we recommend trying PageRank(0.85), PageRank(0.95) or HeatKernel(3), all of which capture some commonly found types of rank propagation. If these are used on large graphs (with thousands or milions of nodes), we recommend passing a stricter tolerance parameter tol1.E-9 to constructors to make sure that the personalization is propagated to most nodes.

Adjacency Matrix Normalization

Node ranking algorithms all use the same default scheme that performs symmetric (i.e. Lalplacian-like) normalization for undirected graphs and column-wise normalization that follows a true probabilistic formulation of transition probabilities for directed graphs, such as DiGraph instances. The type of normalization can be manually edited by passing a normalization argument to constructors of ranking algorithms. This parameter can assume values of "auto" for the above-described default behavior, "col" for column-wise normalization, "symmetric" for symmetric normalization and "none" for avoiding any normalization, for example because edge weights already hold the normalization (e.g. this is used to rank graphs after FairWalk is used to preprocess edge weights).

In all cases, adjacency matrix normalization involves the computationally intensive operation of converting the graph into a scipy sparse matrix each time the rank(G, personalization) method of ranking algorithms is called. The pygrank library provides a way to avoid recomputing the normalization during large-scale experiments by the same algorithm for the same graphs by passing an argument assume_immutability=True to the algorithms's constructor, which indicates that the the graph does not change between runs of the algorithm and hence computes the normalization only once for each given graph, a process known as hashing.

:warning: Hashing only uses the Python object's hash method, so a different instance of the same graph will recompute the normalization if it points at a different memory location.

:warning: Do not alter graph objects after passing them to rank(...) methods of algorithms with assume_immutability=True for the first time. If altering the graph is necessary midway through your code, create a copy instance with one of networkx's in-built methods and edit that one.

For example, hashing the outcome of graph normalization to speed up multiple calls to the same graph can be achieved as per the following code:

from pygrank.algorithms.pagerank import PageRank
G, personalization1, personalization2 = ...
algorithm = PageRank(alpha=0.85, normalization="col", assume_immutability=True)
ranks = algorithm.rank(G, personalization1)
ranks = algorithm.rank(G, personalization2) # does not re-compute the normalization

Sometimes, many different algorithms are applied on the same graph. In this case, to prevent each one from recomputing the hashing already calculated by others, they can be made to share the same normalization method. This can be done by using a shared instance of the normalization preprocessing class preprocessor, which can be passed as the to_scipy argument of ranking algorithm constructors. In this case, the normalization and assume_immutability arguments should be passed to the preprocessor and will be ignored by the constructors (what would otherwise happen is that the constructors would create a prerpocessor with these arguments).

:bulb: Basically, when the default value to_scipy=None is given, ranking algorithms create a new preprocessing instance with the normalization and assume_immutability values passed to their constructor. These two arguments are completely ignored if a preprocessor instance is passed to the ranking algorithm.

For example, using the outcome of graph normalization to speed up multiple rank calls to the same graph by different ranking algorithms can be done as:

from pygrank.algorithms.pagerank import PageRank, HeatKernel
from pygrank.algorithms.utils import preprocessor
G, personalization1, personalization2 = ...
pre = preprocessor(normalization="col", assume_immutability=True)
ranker1 = PageRank(alpha=0.85, to_scipy=pre)
ranker2 = HeatKernel(alpha=0.85, to_scipy=pre)
ranks1 = ranker1.rank(G, personalization1)
ranks2 = ranker2.rank(G, personalization2) # does not re-compute the normalization

Augmenting Node Ranks

It is often desirable to postprocess the outcome of node ranking algorithms. This can be some simple normalization or involve more complex procedures, such as making ranks protect a group of sensitive nodes (hence making them fairness-aware).

The simpler postprocessing mechanisms only aim to transform outputted ranks, for example by normalizing them or outputting their ordinality. For example, the following code wraps the base algorithm with a preprocessor that assigns rank 1 to the highest scored node, 2 to the second highest, etc:

from pygrank.algorithms.postprocess import Ordinals

G, personalization = ...
base_algorithm = ... # e.g. PageRank

algorithm = Ordinals(base_algorithm)
ordinals = algorithm.rank(G, personalization)

For ease of use, this type of postprocessing also provides a transformation method that can be directly used to transform ranks without wrapping the base algotithm. For example, the following code performs the same operation as the previous one:

from pygrank.algorithms.postprocess import Ordinals

G, personalization = ...
base_algorithm = ... # e.g. PageRank

ordinals = Ordinals().transform(base_algorithm.rank(G, personalization))

A second type of postprocessing uses computed ranks to change (e.g. edit) the personalization before re-running the ranking algorithms, sometimes multiple times. For example, the following seed oversampling scheme we first introduced in [krasanakis2019boosted] uses a base_algorithm to rank nodes and then sets seeds to one for nodes with higher ranks than any of the original seeds and then reruns that algorithm with updated seeds:

from pygrank.algorithms.oversampling import SeedOversampling

G, personalization = ...
base_algorithm = ... # e.g. PageRank

algorithm = SeedOversampling(base_algorithm)
ranks = algorithm.rank(G, personalization)

Convergence Criteria

Most base ranking algorithm constructors allow convergence argument that indicates an object to help determing their convergence criteria, such as type of error and tolerance for numerical convergence. If no such argument is passed to the constructor, a pygrank.algorithms.utils.ConvergenceManager object is automatically instantiated by borrowing whichever extra arguments it can from those passed to the constructors. Most frequently used is the tol argument to indicate the numerical tolerance level required for convergence.

Sometimes, it suffices to reach a robust node rank order instead of precise values. To cover such cases we have implemented a different convergence criterion ``pygrank.algorithms.utils.RankOrderConvergenceManager'' that stops at a robust node order [krasanakis2020stopping].

:warning: This criterion is specifically intended to be used with PageRank as the base ranking algorithm and needs to know that algorithm's diffusion parameter.

from pygrank.algorithms.pagerank import PageRank
from pygrank.algorithms.utils import RankOrderConvergenceManager
from pygrank.algorithms.postprocess import Ordinals

G, personalization = ...
alpha = 0.85
ordered_ranker = PageRank(alpha=alpha, convergence=RankOrderConvergenceManager(alpha))
ordered_ranker = Ordinals(ordered_ranker)
ordered_ranks = ordered_ranker.rank(G, personalization)

:bulb: Since the node order is more important than the specific rank values, a post-processing step has been added throught the wrapping expression ``ordered_ranker = Ordinals(ordered_ranker)'' to output rank order.

Rank Quality Evaluation

How to evaluate with an unsupervised metric
from pygrank.algorithms.postprocess import Normalize
from pygrank.metrics.unsupervised import Conductance

G, ranks = ... # calculate as per the first example
normalized_ranks = Normalize().transform(ranks)

metric = Conductance(G)
print(metric.evaluate(normalized_ranks))
How to evaluate with a supervised metric
from pygrank.metrics.supervised import AUC
import pygrank.metrics.utils

G, seeds, algorithm = ... # as per the first example
seeds, ground_truth = pygrank.metrics.utils.split_groups(seeds, fraction_of_training=0.5)

pygrank.metrics.utils.remove_group_edges_from_graph(G, ground_truth)
ranks = algorithm.rank(G, {v: 1 for v in seeds})

metric = AUC({v: 1 for v in ground_truth})
print(metric.evaluate(ranks))
How to evaluate multiple ranks
import networkx as nx
from pygrank.algorithms.pagerank import PageRank as Ranker
from pygrank.algorithms.postprocess import Normalize as Normalizer
from pygrank.algorithms.oversampling import BoostedSeedOversampling as Oversampler
from pygrank.metrics.unsupervised import Conductance
from pygrank.metrics.supervised import AUC
from pygrank.metrics.multigroup import MultiUnsupervised, MultiSupervised, LinkAUC
import pygrank.metrics.utils

# Construct data
G = nx.Graph()
groups = {}
groups["group1"] = list()
... 

# Split to training and test data
training_groups, test_groups = pygrank.metrics.utils.split_groups(groups)
pygrank.metrics.utils.remove_group_edges_from_graph(G, test_groups)

# Calculate ranks and put them in a map
algorithm = Normalizer(Oversampler(Ranker(alpha=0.99)))
ranks = {group_id: algorithm.rank(G, {v: 1 for v in group}) 
        for group_id, group in training_groups.items()}


# Evaluation with Conductance
conductance = MultiUnsupervised(Conductance, G)
print(conductance.evaluate(ranks))

# Evaluation with LinkAUC
link_AUC = LinkAUC(G, pygrank.metrics.utils.to_nodes(test_groups))
print(link_AUC.evaluate(ranks))

# Evaluation with AUC
auc = MultiSupervised(AUC, pygrank.metrics.utils.to_seeds(test_groups))
print(auc.evaluate(ranks))

References

Method References

Instantiation or Usage Method Name Citation
pygrank.algorithms.oversampling.SeedOversampling(ranker) SeedO krasanakis2019boosted
pygrank.algorithms.oversampling.BoostedSeedOversampling(ranker) SeedBO krasanakis2019boosted
pygrank.algorithms.pagerank.PageRank(converge_to_eigenvectors=True) VenueRank krasanakis2018venuerank
G = pygrank.postprocess.fairness.to_fairwalk(G, sensitive) FairWalk rahman2019fairwalk
pygrank.algorithms.postprocess.fairness.FairPostprocessor(ranker,'O') LFPRO tsioutsiouliklis2020fairness
pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker, error_type="mabs", max_residual=0) FP krasanakis2020fairconstr
pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker, 0.8, 10, error_type="mabs", max_residual=0) CFP krasanakis2020fairconstr
pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker) FairEdit krasanakis2020prioredit
pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker, 0.8, 10) FairEditC krasanakis2020prioredit
pygrank.metrics.multigroup.LinkAUC(G, hops=1) LinkAUC krasanakis2019linkauc
pygrank.metrics.multigroup.LinkAUC(G, hops=2) HopAUC krasanakis2020unsupervised
pygrank.algorithms.utils.RankOrderConvergenceManager(alpha, confidence=0.99, criterion="fraction_of_walks") krasanakis2020stopping
pygrank.algorithms.utils.RankOrderConvergenceManager(alpha) krasanakis2020stopping

Publications

The publications that have led to the development of various aspects of this library are presented in reverse chronological order.

@article{krasanakis2020unsupervised,
  title={Unsupervised evaluation of multiple node ranks by reconstructing local structures},
  author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis},
  journal={Applied Network Science},
  volume={5},
  number={1},
  pages={1--32},
  year={2020},
  publisher={Springer}
}
@article{krasanakis2019boosted,
  title={Boosted seed oversampling for local community ranking},
  author={Krasanakis, Emmanouil and Schinas, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis and Symeonidis, Andreas},
  journal={Information Processing \& Management},
  pages={102053},
  year={2019},
  publisher={Elsevier}
}
@inproceedings{krasanakis2020stopping,
  title={Stopping Personalized PageRank without an Error Tolerance Parameter},
  author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Ioannis},
  year={2020},
  booktitle={ASONAM},
}
@inproceedings{krasanakis2020fairconstr,
  title={Applying Fairness Constraints on Graph Node Ranks under Personalization Bias},
  author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Ioannis},
  year={2020},
  booktitle={Complex Networks},
}
@inproceedings{krasanakis2019linkauc,
  title={LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction},
  author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis},
  booktitle={International Conference on Complex Networks and Their Applications},
  pages={3--14},
  year={2019},
  organization={Springer}
}
@inproceedings{krasanakis2018venuerank,
  title={VenueRank: Identifying Venues that Contribute to Artist Popularity},
  author={Krasanakis, Emmanouil and Schinas, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis and Mitkas, Pericles A},
  booktitle={ISMIR},
  pages={702--708},
  year={2018}
}

Under Review

@inproceedings{krasanakis2020prioredit,
  title={Prior Signal Editing for Graph Filter Posterior Fairness Constraints},
  author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Ioannis},
  booktitle={under review},
}

Related

Here, we list additional publications whose methods are fully or partially implemented in this library.

@article{tsioutsiouliklis2020fairness,
  title={Fairness-Aware Link Analysis},
  author={Tsioutsiouliklis, Sotiris and Pitoura, Evaggelia and Tsaparas, Panayiotis and Kleftakis, Ilias and Mamoulis, Nikos},
  journal={arXiv preprint arXiv:2005.14431},
  year={2020}
}
@inproceedings{rahman2019fairwalk,
  title={Fairwalk: Towards Fair Graph Embedding.},
  author={Rahman, Tahleen A and Surma, Bartlomiej and Backes, Michael and Zhang, Yang},
  booktitle={IJCAI},
  pages={3289--3295},
  year={2019}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pygrank-0.1.17-py3-none-any.whl (39.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page