Recommendation algorithms for large graphs on networkx

## Project description

# pygrank

Recommendation algorithms for large graphs.

## Table of Contents

## Installation

```
pip install pygrank
```

## Usage

### Ranking Algorithms

Ranking algorithms assume that there exists a networkx graph `G`

and a non-negative personalization dictionary of node importances
(missing nodes are considered to have 0 importance). For example,
if we consider a list `edges`

of edge tuples and a list `seedS`

of
nodes towards which to consider structural importance, the graph
and personalization can be constructed as:

import networkx as nx edges, seeds = ... G = nx.Graph() for u, v in edges: G.add_edge(u, v) personalization = {v: 1 for v in seeds}

Given the above way of creating a graph and a personalization
dictionary (which is the programming artifact equivalent to what
the literature referred to as a graph signal or personalization vector),
one can instantiate a ranking algorithm class
and call its `rank(G, personalization)`

method to capture
node ranks in a given graph with the given personalization. This
method exists for all ranking algorithms and potential wrappers
that aim to augment ranks in some way
(see Improving Ranking Outcome).
Algorithmic parameters, such as convergence critertia and
the type of adjacency matrix normalization, are passed to the
constructor of the ranking algorithm. As an example, running a
personalized PageRank algorithm with diffusion rate `alpha=0.85`

with absolute rank error tolerance `tol=1.E-6`

(its default parameters)
and printing its node ranks can be done as follows:

from pygrank.algorithms.pagerank import PageRank G, personalization = ... ranker = PageRank(alpha=0.85, tol=1.E-6) ranks = ranker.rank(G, personalization) print('Convergence report:', str(ranker.convergence)) for v, rank in ranks.items(): print('The rank of node', v, 'is', rank)

:bulb: For general-purpose usage, we recommend trying
`PageRank(0.85)`

, `PageRank(0.95)`

or `HeatKernel(3)`

,
all of which capture some commonly found types of
rank propagation. If these are used on large graphs (with
thousands or milions of nodes), we recommend passing a
stricter tolerance parameter `tol1.E-9`

to constructors
to make sure that the personalization is propagated to most nodes.

### Adjacency Matrix Normalization

Node ranking algorithms all use the same default scheme
that performs symmetric (i.e. Lalplacian-like) normalization
for undirected graphs and column-wise normalization that
follows a true probabilistic formulation of transition probabilities
for directed graphs, such as `DiGraph`

instances. The type of
normalization can be manually edited by passing a `normalization`

argument to constructors of ranking algorithms. This parameter can
assume values of "auto" for the above-described default behavior,
"col" for column-wise normalization, "symmetric" for symmetric
normalization and "none" for avoiding any normalization,
for example because edge weights already hold the normalization
(e.g. this is used to rank graphs after FairWalk is used to
preprocess edge weights).

In all cases, adjacency matrix normalization involves the
computationally intensive operation of converting the graph
into a scipy sparse matrix each time the `rank(G, personalization)`

method of ranking algorithms is called. The *pygrank* library
provides a way to avoid recomputing the normalization
during large-scale experiments by the same algorithm for
the same graphs by passing an argument `assume_immutability=True`

to the algorithms's constructor, which indicates that
the the graph does not change between runs of the algorithm
and hence computes the normalization only once for each given
graph, a process known as hashing.

:warning: Hashing only uses the Python object's hash method, so a different instance of the same graph will recompute the normalization if it points at a different memory location.

:warning: Do not alter graph objects after passing them to
`rank(...)`

methods of algorithms with
`assume_immutability=True`

for the first time. If altering the
graph is necessary midway through your code, create a copy
instance with one of *networkx*'s in-built methods and
edit that one.

For example, hashing the outcome of graph normalization to speed up multiple calls to the same graph can be achieved as per the following code:

from pygrank.algorithms.pagerank import PageRank G, personalization1, personalization2 = ... algorithm = PageRank(alpha=0.85, normalization="col", assume_immutability=True) ranks = algorithm.rank(G, personalization1) ranks = algorithm.rank(G, personalization2) # does not re-compute the normalization

Sometimes, many different algorithms are applied on the
same graph. In this case, to prevent each one
from recomputing the hashing already calculated by others,
they can be made to share the same normalization method. This
can be done by using a shared instance of the
normalization preprocessing class `preprocessor`

,
which can be passed as the `to_scipy`

argument of ranking algorithm
constructors. In this case, the `normalization`

and `assume_immutability`

arguments should be passed to the preprocessor and will be ignored by the
constructors (what would otherwise happen is that the constructors
would create a prerpocessor with these arguments).

:bulb: Basically, when the default value `to_scipy=None`

is given, ranking algorithms create a new preprocessing instance
with the `normalization`

and `assume_immutability`

values passed
to their constructor. These two arguments are completely ignored
if a preprocessor instance is passed to the ranking algorithm.

For example, using the outcome of graph normalization to speed up multiple rank calls to the same graph by different ranking algorithms can be done as:

from pygrank.algorithms.pagerank import PageRank, HeatKernel from pygrank.algorithms.utils import preprocessor G, personalization1, personalization2 = ... pre = preprocessor(normalization="col", assume_immutability=True) ranker1 = PageRank(alpha=0.85, to_scipy=pre) ranker2 = HeatKernel(alpha=0.85, to_scipy=pre) ranks1 = ranker1.rank(G, personalization1) ranks2 = ranker2.rank(G, personalization2) # does not re-compute the normalization

### Augmenting Node Ranks

It is often desirable to postprocess the outcome of node ranking algorithms. This can be some simple normalization or involve more complex procedures, such as making ranks protect a group of sensitive nodes (hence making them fairness-aware).

The simpler postprocessing mechanisms only aim to transform outputted ranks, for example by normalizing them or outputting their ordinality. For example, the following code wraps the base algorithm with a preprocessor that assigns rank 1 to the highest scored node, 2 to the second highest, etc:

from pygrank.algorithms.postprocess import Ordinals G, personalization = ... base_algorithm = ... # e.g. PageRank algorithm = Ordinals(base_algorithm) ordinals = algorithm.rank(G, personalization)

For ease of use, this type of postprocessing also provides a transformation method that can be directly used to transform ranks without wrapping the base algotithm. For example, the following code performs the same operation as the previous one:

from pygrank.algorithms.postprocess import Ordinals G, personalization = ... base_algorithm = ... # e.g. PageRank ordinals = Ordinals().transform(base_algorithm.rank(G, personalization))

A second type of postprocessing uses computed ranks to change (e.g. edit)
the personalization before re-running the ranking algorithms, sometimes
multiple times. For example, the following seed oversampling scheme we first
introduced in [krasanakis2019boosted] uses a `base_algorithm`

to rank nodes
and then sets seeds to one for nodes with higher ranks than any of the original seeds
and then reruns that algorithm with updated seeds:

from pygrank.algorithms.oversampling import SeedOversampling G, personalization = ... base_algorithm = ... # e.g. PageRank algorithm = SeedOversampling(base_algorithm) ranks = algorithm.rank(G, personalization)

### Convergence Criteria

Most base ranking algorithm constructors allow `convergence`

argument that
indicates an object to help determing their convergence criteria, such as type of
error and tolerance for numerical convergence. If no such argument is passed
to the constructor, a `pygrank.algorithms.utils.ConvergenceManager`

object
is automatically instantiated by borrowing whichever extra arguments it can
from those passed to the constructors. Most frequently used is the `tol`

argument to indicate the numerical tolerance level required for convergence.

Sometimes, it suffices to reach a robust node rank order instead of precise values. To cover such cases we have implemented a different convergence criterion ``pygrank.algorithms.utils.RankOrderConvergenceManager'' that stops at a robust node order [krasanakis2020stopping].

:warning: This criterion is specifically intended to be used with PageRank as the base ranking algorithm and needs to know that algorithm's diffusion parameter.

from pygrank.algorithms.pagerank import PageRank from pygrank.algorithms.utils import RankOrderConvergenceManager from pygrank.algorithms.postprocess import Ordinals G, personalization = ... alpha = 0.85 ordered_ranker = PageRank(alpha=alpha, convergence=RankOrderConvergenceManager(alpha)) ordered_ranker = Ordinals(ordered_ranker) ordered_ranks = ordered_ranker.rank(G, personalization)

:bulb: Since the node order is more important than the specific rank values, a post-processing step has been added throught the wrapping expression ``ordered_ranker = Ordinals(ordered_ranker)'' to output rank order.

### Rank Quality Evaluation

###### How to evaluate with an unsupervised metric

from pygrank.algorithms.postprocess import Normalize from pygrank.metrics.unsupervised import Conductance G, ranks = ... # calculate as per the first example normalized_ranks = Normalize().transform(ranks) metric = Conductance(G) print(metric.evaluate(normalized_ranks))

###### How to evaluate with a supervised metric

from pygrank.metrics.supervised import AUC import pygrank.metrics.utils G, seeds, algorithm = ... # as per the first example seeds, ground_truth = pygrank.metrics.utils.split_groups(seeds, fraction_of_training=0.5) pygrank.metrics.utils.remove_group_edges_from_graph(G, ground_truth) ranks = algorithm.rank(G, {v: 1 for v in seeds}) metric = AUC({v: 1 for v in ground_truth}) print(metric.evaluate(ranks))

###### How to evaluate multiple ranks

import networkx as nx from pygrank.algorithms.pagerank import PageRank as Ranker from pygrank.algorithms.postprocess import Normalize as Normalizer from pygrank.algorithms.oversampling import BoostedSeedOversampling as Oversampler from pygrank.metrics.unsupervised import Conductance from pygrank.metrics.supervised import AUC from pygrank.metrics.multigroup import MultiUnsupervised, MultiSupervised, LinkAUC import pygrank.metrics.utils # Construct data G = nx.Graph() groups = {} groups["group1"] = list() ... # Split to training and test data training_groups, test_groups = pygrank.metrics.utils.split_groups(groups) pygrank.metrics.utils.remove_group_edges_from_graph(G, test_groups) # Calculate ranks and put them in a map algorithm = Normalizer(Oversampler(Ranker(alpha=0.99))) ranks = {group_id: algorithm.rank(G, {v: 1 for v in group}) for group_id, group in training_groups.items()} # Evaluation with Conductance conductance = MultiUnsupervised(Conductance, G) print(conductance.evaluate(ranks)) # Evaluation with LinkAUC link_AUC = LinkAUC(G, pygrank.metrics.utils.to_nodes(test_groups)) print(link_AUC.evaluate(ranks)) # Evaluation with AUC auc = MultiSupervised(AUC, pygrank.metrics.utils.to_seeds(test_groups)) print(auc.evaluate(ranks))

## References

### Method References

Instantiation or Usage | Method Name | Citation |
---|---|---|

`pygrank.algorithms.oversampling.SeedOversampling(ranker)` |
SeedO | krasanakis2019boosted |

`pygrank.algorithms.oversampling.BoostedSeedOversampling(ranker)` |
SeedBO | krasanakis2019boosted |

`pygrank.algorithms.pagerank.PageRank(converge_to_eigenvectors=True)` |
VenueRank | krasanakis2018venuerank |

`G = pygrank.postprocess.fairness.to_fairwalk(G, sensitive)` |
FairWalk | rahman2019fairwalk |

`pygrank.algorithms.postprocess.fairness.FairPostprocessor(ranker,'O')` |
LFPRO | tsioutsiouliklis2020fairness |

`pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker, error_type="mabs", max_residual=0)` |
FP | krasanakis2020fairconstr |

`pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker, 0.8, 10, error_type="mabs", max_residual=0)` |
CFP | krasanakis2020fairconstr |

`pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker)` |
FairEdit | krasanakis2020prioredit |

`pygrank.algorithms.postprocess.fairness.FairPersonalizer(ranker, 0.8, 10)` |
FairEditC | krasanakis2020prioredit |

`pygrank.metrics.multigroup.LinkAUC(G, hops=1)` |
LinkAUC | krasanakis2019linkauc |

`pygrank.metrics.multigroup.LinkAUC(G, hops=2)` |
HopAUC | krasanakis2020unsupervised |

`pygrank.algorithms.utils.RankOrderConvergenceManager(alpha, confidence=0.99, criterion="fraction_of_walks")` |
krasanakis2020stopping | |

`pygrank.algorithms.utils.RankOrderConvergenceManager(alpha)` |
krasanakis2020stopping |

### Publications

The publications that have led to the development of various aspects of this library are presented in reverse chronological order.

```
@article{krasanakis2020unsupervised,
title={Unsupervised evaluation of multiple node ranks by reconstructing local structures},
author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis},
journal={Applied Network Science},
volume={5},
number={1},
pages={1--32},
year={2020},
publisher={Springer}
}
```

```
@article{krasanakis2019boosted,
title={Boosted seed oversampling for local community ranking},
author={Krasanakis, Emmanouil and Schinas, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis and Symeonidis, Andreas},
journal={Information Processing \& Management},
pages={102053},
year={2019},
publisher={Elsevier}
}
```

```
@inproceedings{krasanakis2020stopping,
title={Stopping Personalized PageRank without an Error Tolerance Parameter},
author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Ioannis},
year={2020},
booktitle={ASONAM},
}
```

```
@inproceedings{krasanakis2020fairconstr,
title={Applying Fairness Constraints on Graph Node Ranks under Personalization Bias},
author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Ioannis},
year={2020},
booktitle={Complex Networks},
}
```

```
@inproceedings{krasanakis2019linkauc,
title={LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction},
author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis},
booktitle={International Conference on Complex Networks and Their Applications},
pages={3--14},
year={2019},
organization={Springer}
}
```

```
@inproceedings{krasanakis2018venuerank,
title={VenueRank: Identifying Venues that Contribute to Artist Popularity},
author={Krasanakis, Emmanouil and Schinas, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Yiannis and Mitkas, Pericles A},
booktitle={ISMIR},
pages={702--708},
year={2018}
}
```

### Under Review

```
@inproceedings{krasanakis2020prioredit,
title={Prior Signal Editing for Graph Filter Posterior Fairness Constraints},
author={Krasanakis, Emmanouil and Papadopoulos, Symeon and Kompatsiaris, Ioannis},
booktitle={under review},
}
```

### Related

Here, we list additional publications whose methods are fully or partially implemented in this library.

```
@article{tsioutsiouliklis2020fairness,
title={Fairness-Aware Link Analysis},
author={Tsioutsiouliklis, Sotiris and Pitoura, Evaggelia and Tsaparas, Panayiotis and Kleftakis, Ilias and Mamoulis, Nikos},
journal={arXiv preprint arXiv:2005.14431},
year={2020}
}
```

```
@inproceedings{rahman2019fairwalk,
title={Fairwalk: Towards Fair Graph Embedding.},
author={Rahman, Tahleen A and Surma, Bartlomiej and Backes, Michael and Zhang, Yang},
booktitle={IJCAI},
pages={3289--3295},
year={2019}
}
```

## Project details

## Release history Release notifications | RSS feed

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|

Filename, size pygrank-0.1.17-py3-none-any.whl (39.5 kB) | File type Wheel | Python version py3 | Upload date | Hashes View |