Skip to main content

Set up a Kubernetes cluster for distributed AI research

Project description

Cloudwise


cloudwise is Surreal’s cloud infrastructure provisioner based on Terraform. Surreal’s website and github.

It prepares a kubernetes cluster using terraform. It generates .tf.json files that are also recognized by Symphony.

Installation

  • Cloud wise runs in python 3

  • Do git clone git@github.com:SurrealAI/cloudwise.git && cd cloudwise

  • Run pip install -e . in this directory.

  • Install terraform following instructions here

  • Install kubectl following instructions here

Usage

  • (Optional, Recommended) Create and work in a clean directory as running terraform would generate relevant files.

> mkdir surreal
> cd surreal

Google Cloud

  • You first need to setup credentials for terraform to access google cloud. See guide here. Choose one of the two methods:

    • Run the following command

    gcloud auth application-default login

    or

    • Go to the api key management page https://console.cloud.google.com/apis/credentials/serviceaccountkey and select Create new service account. You would need to give the service account sufficient permissions to do things properly. Project editor would suffice but is also more than enough. You can then generate and download the key, (json format is fine). Put the path to the .json file into the commandline argument when prompted.

  • Follow the instructions in the commandline tool.

> cloudwise-gke

It will provide instructions and generate a <cluster_name>.tf.json file which terraform recognizes. If you have generated a .json credential file, you should provide it when prompted. * terraform init && terraform plan describes changes to be made. * terraform apply makes the changes to your cloud project. * After cluster creation, obtain credentials for kubectl.

> gcloud container clusters get-credentials <cluster_name>
  • If you have GPUs in your cluster, create the daemon set to install drivers, see documentation.

> kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml
  • The generated <cluster_name>.tf.json is also recognized by Symphony’s scheduling mechanism and Surreal. So you may want to link to it

  • If you want to remove everything, run terraform destroy

AWS

Stay tuned

Azure

Stay tuned

FAQs:

  • Terraform install fails.

    • If you are seeing error: ... API has not been used in project...: during terraform apply, go to the Kubernetes Engine tab and/or Compute Engine tab on your google cloud console to enable their APIs.

  • GPU nodes are not scaling up.

    • Check if the driver installation daemon set is running (see documentation).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudwise-0.1.1.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

cloudwise-0.1.1-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file cloudwise-0.1.1.tar.gz.

File metadata

  • Download URL: cloudwise-0.1.1.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.5.5

File hashes

Hashes for cloudwise-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0cf1758321d4191caed367369c01976e15e98259d890b69fbed0f95052529f3d
MD5 1abd7d7fb45a4297b4c6779d2130d488
BLAKE2b-256 30d3734b2449ae6b66c7f91bb5ffd2af344b8ef016a36e58b6cf211ea3dbdcb6

See more details on using hashes here.

File details

Details for the file cloudwise-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: cloudwise-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.5.5

File hashes

Hashes for cloudwise-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b19b5d29f42f8d91a45bc2fbba3185e6e7b699b919a165a1d9f7ebd35aeaa8de
MD5 759551f9802653959238f1e10113f047
BLAKE2b-256 ed85ee86a4dd11894820b9be5983625e09e10d5fc79cb4b873b098b0725ce6ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page