Skip to main content

A parallelized, efficient, and accelerated node2vec

Project description

DOI Documentation Status Code style: black Tests

PecanPy: A parallelized, efficient, and accelerated node2vec(+) in Python

Learning low-dimensional representations (embeddings) of nodes in large graphs is key to applying machine learning on massive biological networks. Node2vec is the most widely used method for node embedding. PecanPy is a fast, parallelized, memory efficient, and cache optimized Python implementation of node2vec. It uses cache-optimized compact graph data structures and precomputing/parallelization to result in fast, high-quality node embeddings for biological networks of all sizes and densities. Detailed source code documentation can be found here.

The details of implementation and the optimizations, along with benchmarks, are described in the application note PecanPy: a fast, efficient and parallelized Python implementation of node2vec, which is published in Bioinformatics. The benchmarking results presented in the preprint can be reproduced using the test scripts provided in the companion benchmarks repo.

v2 update: PecanPy is now equipped with node2vec+, which is a natural extension of node2vec and handles weighted graph more effectively. For more information, see Accurately Modeling Biased Random Walks on Weighted Wraphs Using Node2vec+. The datasets and test scripts for reproducing the presented results are available in the node2vec+ benchmarks repo.

Installation

Install from the latest release with:

$ pip install pecanpy

Install latest version (unreleassed) in development mode with:

$ git clone https://github.com/krishnanlab/pecanpy.git
$ cd pecanpy
$ pip install -e .

where -e means "editable" mode so you don't have to reinstall every time you make changes.

PecanPy installs a command line utility pecanpy that can be used directly.

Usage

PecanPy operates in three different modes – PreComp, SparseOTF, and DenseOTF – that are optimized for networks of different sizes and densities; PreComp for networks that are small (≤10k nodes; any density), SparseOTF for networks that are large and sparse (>10k nodes; ≤10% of edges), and DenseOTF for networks that are large and dense (>10k nodes; >10% of edges). These modes appropriately take advantage of compact/dense graph data structures, precomputing transition probabilities, and computing 2nd-order transition probabilities during walk generation to achieve significant improvements in performance.

Example

To run node2vec on Zachary's karate club network using SparseOTF mode, execute the following command from the project home directory:

pecanpy --input demo/karate.edg --output demo/karate.emb --mode SparseOTF

Node2vec+

To enable node2vec+, specify the --extend option.

pecanpy --input demo/karate.edge --output demo/karate_n2vplus.emb --mode SparseOTF --extend

Note: node2vec+ is only beneficial for embedding weighted graphs. For unweighted graphs, node2vec+ is equivalent to node2vec. The above example only serves as a demonstration of enabling node2vec+.

Demo

Execute the following command for full demonstration:

sh demo/run_pecanpy

Mode

As mentioned above, PecanPy contains three main modes for generating node2vec random walks, each of which is better optimized for different network sizes/densities:

Mode Network size/density Optimization
PreComp <10k nodes, <0.1% edges Precompute second order transition probabilities, using CSR graph
SparseOTF (default) (≥10k nodes, ≥0.1% and <20% of edges) or (<10k nodes, ≥0.1% edges) Transition probabilites computed on-the-fly, using CSR graph
DenseOTF >20% of edges Transition probabilities computed on-the-fly, using dense matrix

Compatibility and recommendations

Mode Weighted graph p,q!=1 Node2vec+ Use this if
PreComp :white_check_mark: :white_check_mark: :white_check_mark: The graph is small and sparse
SparseOTF :white_check_mark: :white_check_mark: :white_check_mark: The graph is sparse but not necessarily small
DenseOTF :white_check_mark: :white_check_mark: :white_check_mark: The graph is extremely dense
PreCompFirstOrder :white_check_mark: :x: :x: Run with p = q = 1 on weighted graph
FirstOrderUnweighted :x: :x: :x: Run with p = q = 1 on unweighted graph

Options

Check out the full list of options available using:

pecanpy --help

Input

The supported input is a network file as an edgelist .edg file (node id could be int or string):

node1_id node2_id <weight_float, optional>

Another supported input format (only for DenseOTF) is the numpy array .npz file. Run the following command to prepare a .npz file from a .edg file.

pecanpy --input $input_edgelist --output $output_npz --task todense

Output

The output file has n+1 lines for graph with n vertices, with a header line of the following format:

num_of_nodes dim_of_representation

The following next n lines are the representations of dimension d following the corresponding node ID:

node_id dim_1 dim_2 ... dim_d

Development Note

Run black src/pecanpy/ to automatically follow black code formatting.
Run tox -e flake8 and resolve suggestions before committing to ensure consistent code style.

Additional Information

Documentation

Detailed documentation for PecanPy is available here.

Support

For support please contact Remy Liu at liurenmi@msu.edu.

License

This repository and all its contents are released under the BSD 3-Clause License; See LICENSE.md.

Citation

If you use PecanPy, please cite:
Liu R, Krishnan A (2021) PecanPy: a fast, efficient, and parallelized Python implementation of node2vec. Bioinformatics https://doi.org/10.1093/bioinformatics/btab202

If you find node2vec+ useful, please cite:
Liu R, Hirn M, Krishnan A (2021) Accurately Modeling Biased Random Walks on Weighted Wraphs Using Node2vec+. axXiv https://arxiv.org/abs/2109.08031

Authors

Renming Liu, Arjun Krishnan*

*General correspondence should be addressed to AK at arjun@msu.edu.

Funding

This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK.

Acknowledgements

We thank Christopher A. Mancuso, Anna Yannakopoulos, and the rest of the Krishnan Lab for valuable discussions and feedback on the software and manuscript. Thanks to Charles T. Hoyt for making the software pip installable and for an extensive code review.

References

Original node2vec

  • Grover, A. and Leskovec, J. (2016) node2vec: Scalable Feature Learning for Networks. ArXiv160700653 Cs Stat. Original node2vec software and networks
    • https://snap.stanford.edu/node2vec/ contains the original software and the networks (PPI, BlogCatalog, and Wikipedia) used in the original study (Grover and Leskovec, 2016).

Other networks

  • Stark, C. et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res., 34, D535–D539.

    • BioGRID human protein-protein interactions.
  • Szklarczyk, D. et al. (2015) STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res., 43, D447–D452.

    • STRING predicted human gene interactions.
  • Greene, C.S. et al. (2015) Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet., 47, 569–576.

    • GIANT-TN is a generic genome-scale human gene network. GIANT-TN-c01 is a sub-network of GIANT-TN where edges with edge weight below 0.01 are discarded.

BioGRID (Stark et al., 2006), STRING (Szklarczyk et al., 2015), and GIANT-TN (Greene et al., 2015) are available from https://doi.org/10.5281/zenodo.3352323.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pecanpy-2.0.2.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pecanpy-2.0.2-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file pecanpy-2.0.2.tar.gz.

File metadata

  • Download URL: pecanpy-2.0.2.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.8.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.12

File hashes

Hashes for pecanpy-2.0.2.tar.gz
Algorithm Hash digest
SHA256 7092a2b8c8cd07ec6f783fc3700a4f8d64055da46722d3caae020e325fb5ac70
MD5 3f0bbb41ac138945af68c55c096b9aa6
BLAKE2b-256 d4144085cd5004e5587bcf3752eaf67f2158ff9b610cb9885a027ab1b7d982e8

See more details on using hashes here.

File details

Details for the file pecanpy-2.0.2-py3-none-any.whl.

File metadata

  • Download URL: pecanpy-2.0.2-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.8.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.12

File hashes

Hashes for pecanpy-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 72a86ae673a526b9fd213b883df347a157184db4daa2267d7a96ceae224c8445
MD5 b31d57a7fb56f389bb89f36d90d63d3c
BLAKE2b-256 652fde4138961d95df7f68e421e2081a0d41798c9cae02185cf29bdf31bbde9a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page