Skip to main content

GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba

Project description

graphscope-logo

A One-Stop Large-Scale Graph Computing System from Alibaba

GraphScope CI Coverage Playground Open in Colab Artifact HUB Docs-en FAQ-en Docs-zh FAQ-zh README-zh

GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simple by combining several important pieces of Alibaba technology: including GRAPE, MaxGraph, and Graph-Learn (GL) for analytics, interactive, and graph neural networks (GNN) computation, respectively, and the Vineyard store that offers efficient in-memory data transfers.

Visit our website at graphscope.io to learn more.

Table of Contents

Getting Started

We provide a Playground with a managed JupyterLab. Try GraphScope straight away in your browser!

GraphScope supports run in standalone mode or on clusters managed by Kubernetes within containers. For quickly getting started, let's begin with the standalone mode.

Installation for Standalone Mode

GraphScope pre-compiled package is distributed as a python package and can be easily installed with pip.

pip3 install graphscope

Note that graphscope requires Python >= 3.6 and pip >= 19.0. The package is built for and tested on the most popular Linux (Ubuntu 18.04+ / CentOS 7+) and macOS (version 10.15+) distributions. For Windows users, you may want to install Ubuntu on WSL2 to use this package.

Next, we will walk you through a concrete example to illustrate how GraphScope can be used by data scientists to effectively analyze large graphs.

Demo: Node Classification on Citation Network

ogbn-mag is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.

Given the heterogeneous ogbn-mag data, the task is to predict the class of each paper. Node classification can identify papers in multiple venues, which represent different groups of scientific work on different topics. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.

Loading a graph

GraphScope models graph data as property graph, in which the edges/vertices are labeled and have many properties. Taking ogbn-mag as example, the figure below shows the model of the property graph.

sample-of-property-graph

This graph has four kinds of vertices, labeled as paper, author, institution and field_of_study. There are four kinds of edges connecting them, each kind of edges has a label and specifies the vertex labels for its two ends. For example, cites edges connect two vertices labeled paper. Another example is writes, it requires the source vertex is labeled author and the destination is a paper vertex. All the vertices and edges may have properties. e.g., paper vertices have properties like features, publish year, subject label, etc.

To load this graph to GraphScope with our retrieval module, please use these code:

import graphscope
from graphscope.dataset import load_ogbn_mag

g = load_ogbn_mag()

We provide a set of functions to load graph datasets from ogb and snap for convenience. Please find all the available graphs here. If you want to use your own graph data, please refer this doc to load vertices and edges by labels.

Interactive query

Interactive queries allow users to directly explore, examine, and present graph data in an exploratory manner in order to locate specific or in-depth information in time. GraphScope adopts a high-level language called Gremlin for graph traversal, and provides efficient execution at scale.

In this example, we use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID 2 and 4307, respectively.

# get the endpoint for submitting Gremlin queries on graph g.
interactive = graphscope.gremlin(g)

# count the number of papers two authors (with id 2 and 4307) have co-authored
papers = interactive.execute("g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()").one()

Graph analytics

Graph analytics is widely used in real world. Many algorithms, like community detection, paths and connectivity, centrality are proven to be very useful in various businesses. GraphScope ships with a set of built-in algorithms, enables users easily analysis their graph data.

Continuing our example, below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node.

Please note that many algorithms may only work on homogeneous graphs, and therefore, to evaluate these algorithms over a property graph, we need to project it into a simple graph at first.

# extract a subgraph of publication within a time range
sub_graph = interactive.subgraph("g.V().has('year', gte(2014).and(lte(2020))).outE('cites')")

# project the projected graph to simple graph.
simple_g = sub_graph.project(vertices={"paper": []}, edges={"cites": []})

ret1 = graphscope.k_core(simple_g, k=5)
ret2 = graphscope.triangles(simple_g)

# add the results as new columns to the citation graph
sub_graph = sub_graph.add_column(ret1, {"kcore": "r"})
sub_graph = sub_graph.add_column(ret2, {"tc": "r"})

In addition, users can write their own algorithms in GraphScope. Currently, GraphScope supports users to write their own algorithms in Pregel model and PIE model.

Graph neural networks (GNNs)

Graph neural networks (GNNs) combines superiority of both graph analytics and machine learning. GNN algorithms can compress both structural and attribute information in a graph into low-dimensional embedding vectors on each node. These embeddings can be further fed into downstream machine learning tasks.

In our example, we train a GCN model to classify the nodes (papers) into 349 categories, each of which represents a venue (e.g. pre-print and conference). To achieve this, first we launch a learning engine and build a graph with features following the last step.

# define the features for learning
paper_features = []
for i in range(128):
    paper_features.append("feat_" + str(i))

paper_features.append("kcore")
paper_features.append("tc")

# launch a learning engine.
lg = graphscope.graphlearn(sub_graph, nodes=[("paper", paper_features)],
                  edges=[("paper", "cites", "paper")],
                  gen_labels=[
                      ("train", "paper", 100, (0, 75)),
                      ("val", "paper", 100, (75, 85)),
                      ("test", "paper", 100, (85, 100))
                  ])

Then we define the training process, and run it.

# Note: Here we use tensorflow as NN backend to train GNN model. so please
# install tensorflow.
try:
    # https://www.tensorflow.org/guide/migrate
    import tensorflow.compat.v1 as tf
    tf.disable_v2_behavior()
except ImportError:
    import tensorflow as tf

import graphscope.learning
from graphscope.learning.examples import EgoGraphSAGE
from graphscope.learning.examples import EgoSAGESupervisedDataLoader
from graphscope.learning.examples.tf.trainer import LocalTrainer

# supervised GCN.
def train_gcn(graph, node_type, edge_type, class_num, features_num,
              hops_num=2, nbrs_num=[25, 10], epochs=2,
              hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01,
):
    graphscope.learning.reset_default_tf_graph()

    dimensions = [features_num] + [hidden_dim] * (hops_num - 1) + [class_num]
    model = EgoGraphSAGE(dimensions, act_func=tf.nn.relu, dropout=in_drop_rate)

    # prepare train dataset
    train_data = EgoSAGESupervisedDataLoader(
        graph, graphscope.learning.Mask.TRAIN,
        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
    )
    train_embedding = model.forward(train_data.src_ego)
    train_labels = train_data.src_ego.src.labels
    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            labels=train_labels, logits=train_embedding,
        )
    )
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

    # prepare test dataset
    test_data = EgoSAGESupervisedDataLoader(
        graph, graphscope.learning.Mask.TEST,
        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
    )
    test_embedding = model.forward(test_data.src_ego)
    test_labels = test_data.src_ego.src.labels
    test_indices = tf.math.argmax(test_embedding, 1, output_type=tf.int32)
    test_acc = tf.div(
        tf.reduce_sum(tf.cast(tf.math.equal(test_indices, test_labels), tf.float32)),
        tf.cast(tf.shape(test_labels)[0], tf.float32),
    )

    # train and test
    trainer = LocalTrainer()
    trainer.train(train_data.iterator, loss, optimizer, epochs=epochs)
    trainer.test(test_data.iterator, test_acc)

train_gcn(lg, node_type="paper", edge_type="cites",
          class_num=349,  # output dimension
          features_num=130,  # input dimension, 128 + kcore + triangle count
)

A python script with the entire process is availabe here, you may try it out by yourself.

Processing Large Graph on Kubernetes Cluster

GraphScope is designed for processing large graphs, which are usually hard to fit in the memory of a single machine. With Vineyard as the distributed in-memory data manager, GraphScope supports run on a cluster managed by Kubernetes(k8s).

To continue this tutorial, please ensure that you have a k8s managed cluster and know the credentials for the cluster. (e.g., address of k8s API server, usually stored a ~/.kube/config file.)

Alternatively, you can set up a local k8s cluster for testing with Kind. We provide a script for setup this environment.

# for usage, type -h
./scripts/install_deps.sh --k8s

If you did not install the graphscope package in the above step, you can install a subset of the whole package with client functions only.

pip3 install graphscope-client

Next, let's revisit the example by running on a cluster instead.

how-it-works

The figure shows the flow of execution in the cluster mode. When users run code in the python client, it will:

  • Step 1. Create a session or workspace in GraphScope.
  • Step 2 - Step 5. Load a graph, query, analysis and run learning task on this graph via Python interface. These steps are the same to local mode, thus users process huge graphs in a distributed setting just like analysis a small graph on a single machine.(Note that graphscope.gremlin and graphscope.graphlearn need to be changed to sess.gremlin and sess.graphlearn, respectively. sess is the name of the Session instance user created.)
  • Step 6. Close the session.

Creating a session

To use GraphScope in a distributed setting, we need to establish a session in a python interpreter.

For convenience, we provide several demo datasets, and an option with_dataset to mount the dataset in the graphscope cluster. The datasets will be mounted to /dataset in the pods. If you want to use your own data on k8s cluster, please refer to this.

import graphscope

sess = graphscope.session(with_dataset=True)

For macOS, the session needs to establish with the LoadBalancer service type (which is NodePort by default).

import graphscope

sess = graphscope.session(with_dataset=True, k8s_service_type="LoadBalancer")

A session tries to launch a coordinator, which is the entry for the back-end engines. The coordinator manages a cluster of resources (k8s pods), and the interactive/analytical/learning engines ran on them. For each pod in the cluster, there is a vineyard instance at service for distributed data in memory.

Loading a graph and processing computation tasks

Similar to the standalone mode, we can still use the functions to load a graph easily.

from graphscope.dataset import load_ogbn_mag

# Note we have mounted the demo datasets to /dataset,
# There are several datasets including ogbn_mag_small,
# User can attach to the engine container and explore the directory.
g = load_ogbn_mag(sess, "/dataset/ogbn_mag_small/")

Here, the g is loaded in parallel via vineyard and stored in vineyard instances in the cluster managed by the session.

Next, we can conduct graph queries with Gremlin, invoke various graph algorithms, or run graph-based neural network tasks like we did in the standalone mode. We do not repeat code here, but a .ipynb processing the classification task on k8s is available on the Playground.

Closing the session

Another additional step in the distribution is session close. We close the session after processing all graph tasks.

sess.close()

This operation will notify the backend engines and vineyard to safely unload graphs and their applications, Then, the coordinator will dealloc all the applied resources in the k8s cluster.

Please note that we have not hardened this release for production use and it lacks important security features such as authentication and encryption, and therefore it is NOT recommended for production use (yet)!

Development

Building on local

To build graphscope Python package and the engine binaries, you need to install some dependencies and build tools.

./scripts/install_deps.sh --dev

# With argument --cn to speed up the download if you are in China.
./scripts/install_deps.sh --dev --cn

Then you can build GraphScope with pre-configured make commands.

# to make graphscope whole package, including python package + engine binaries.
sudo make install

# or make the engine components
# make interactive
# make analytical
# make learning

Building Docker images

GraphScope ships with a Dockerfile that can build docker images for releasing. The images are built on a builder image with all dependencies installed and copied to a runtime-base image. To build images with latest version of GraphScope, go to the k8s/internal directory under root directory and run this command.

# by default, the built image is tagged as graphscope/graphscope:SHORTSHA
# cd k8s
make graphscope

Building client library

GraphScope python interface is separate with the engines image. If you are developing python client and not modifying the protobuf files, the engines image doesn't require to be rebuilt.

You may want to re-install the python client on local.

make client

Note that the learning engine client has C/C++ extensions modules and setting up the build environment is a bit tedious. By default the locally-built client library doesn't include the support for learning engine. If you want to build client library with learning engine enabled, please refer Build Python Wheels.

Testing

To verify the correctness of your developed features, your code changes should pass our tests.

You may run the whole test suite with commands:

make test

Documentation

Documentation can be generated using Sphinx. Users can build the documentation using:

# build the docs
make graphscope-docs

# to open preview on local
open docs/_build/latest/html/index.html

The latest version of online documentation can be found at https://graphscope.io/docs

License

GraphScope is released under Apache License 2.0. Please note that third-party libraries may not have the same license as GraphScope.

Publications

  • Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang Yao, Qiang Yin, Wenyuan Yu, Jingren Zhou, Diwen Zhu, Rong Zhu. GraphScope: A Unified Engine For Big Graph Processing. The 47th International Conference on Very Large Data Bases (VLDB), industry, 2021.
  • Jingbo Xu, Zhanning Bai, Wenfei Fan, Longbin Lai, Xue Li, Zhao Li, Zhengping Qian, Lei Wang, Yanyan Wang, Wenyuan Yu, Jingren Zhou. GraphScope: A One-Stop Large Graph Processing System. The 47th International Conference on Very Large Data Bases (VLDB), demo, 2021

Contributing

Any contributions you make are greatly appreciated!

  • Join in the Slack channel for discussion.
  • Please report bugs by submitting a GitHub issue.
  • Please submit contributions using pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

graphscope_client-0.19.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

graphscope_client-0.19.0-cp311-cp311-macosx_10_9_x86_64.whl (22.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

graphscope_client-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

graphscope_client-0.19.0-cp310-cp310-macosx_10_9_x86_64.whl (22.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

graphscope_client-0.19.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

graphscope_client-0.19.0-cp39-cp39-macosx_10_9_x86_64.whl (22.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

graphscope_client-0.19.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.9 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

graphscope_client-0.19.0-cp38-cp38-macosx_10_9_x86_64.whl (22.3 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

graphscope_client-0.19.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

graphscope_client-0.19.0-cp37-cp37m-macosx_10_9_x86_64.whl (22.3 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file graphscope_client-0.19.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1a6f9c6610171da51a5827ed5457d610bc8bf40240d468e46d7cda79b7991d2c
MD5 f1575bdbc163a3383af49c49dac6d5c6
BLAKE2b-256 c6db7393a756aa2fc8da5aedc79631e700db9710015c9986d91004960fae53c2

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 227d82d7bdcf3b5de77ec8875620e3b7b3d40bd6918298124a4583a8e0317c77
MD5 9086117c7d61ba7332044ee2e9a6bef7
BLAKE2b-256 28a24e82bcdc973af0026a191dab4775c4ce80a5d5b546794ec5fb8b75ce93a1

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b3f17955f7b21816ccc8f9e6105e441fdc840b3b400ecc8ba35164ba5082f1e1
MD5 0c25f25999d141c54282ada7c113600a
BLAKE2b-256 73d337fb77f5842197e08b63d74c9bd566066ad7063dc8759ed0a1cbcf4353a9

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ceb11c6df9cd5f09e8511dcc9101abb0c2a30278708b6fe97706d3999c629cf4
MD5 2764cf296f49d5c73698b8d4a6ff2b2c
BLAKE2b-256 829583f52b5f299fe99e1264e2faef2368e3eac1b14de8b2c2ec4153402e6518

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b2c1775fcbdbf7a3916e2f1826834aca5ed128eddf747cd12425811f572af596
MD5 7e0ca601e5a82135ccdbf3778330b5ad
BLAKE2b-256 c9acf30a6d176560e6fbf05b96b46b7b6a3f9b2e417b96d702d64dbd4f79d241

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ecb0290f3663f9650975a32ed552d52ce57e500517015baa1a0347b74cb9db1f
MD5 bbd2dd6f0ff43965da2a66b6594aefe9
BLAKE2b-256 460ad59af8d5055a3b94cb43810887a13c5e61d0f588d1ddb20fc44fd8599c25

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6c5c070e8cbaaba3061ef1322544bce63bafdf56a2a91e913e546522c23b6134
MD5 0fdd6f97b8ac29592aa118c616773b79
BLAKE2b-256 ef69ee816e3652dd2ceada5d411fe27fd1faacfa0f2853a6429025e3f4fb0f91

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5ce7bed84bd1ca267ff20a83f9163547e049e89731996d44ed46c22788414a2f
MD5 0ffbc7d181fd663e457373ad68276284
BLAKE2b-256 043a4e7dd4769626112b87624bb06f9640d8bfeb4fe26ff3beaf9d6b29baab4e

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8ff50e1adbd936d93ccc5a3cc5cda33408dfedca4349181acf96228aeb9ae78
MD5 9afffd0d7d8e0c28112fe29bf57bd320
BLAKE2b-256 74c610f51fe94181fa34667ab541fec6e55f589133de3ac978f1d9f8c73d9936

See more details on using hashes here.

File details

Details for the file graphscope_client-0.19.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for graphscope_client-0.19.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e55f8e9052ee9e4a81637e2649424698f5f0a102833e1ae7213d09cedec67c15
MD5 ef6c68534554292f54213f0bb6775362
BLAKE2b-256 e8d1d0d09b9d980ce6451f1ece24ce3323c5bb9cec2157d97f5934a78dca5a44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page