Backend.AI Agent
Project description
Backend.AI Agent
The Backend.AI Agent is a small daemon that does:
- Reports the status and available resource slots of a worker to the manager
- Routes code execution requests to the designated kernel container
- Manages the lifecycle of kernel containers (create/monitor/destroy them)
Package Structure
ai.backend
agent
: The agent packageserver
: The agent daemon which communicates with the manager and the Docker daemonwatcher
: A side-by-side daemon which provides a separate HTTP endpoint for accessing the status information of the agent daemon and manipulation of the agent's systemd service
Installation
Please visit the installation guides.
Kernel/system configuration
Recommended kernel parameters in the bootloader (e.g., Grub):
cgroup_enable=memory swapaccount=1
Recommended resource limits:
/etc/security/limits.conf
root hard nofile 512000
root soft nofile 512000
root hard nproc 65536
root soft nproc 65536
user hard nofile 512000
user soft nofile 512000
user hard nproc 65536
user soft nproc 65536
sysctl
fs.file-max=2048000
net.core.somaxconn=1024
net.ipv4.tcp_max_syn_backlog=1024
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_window_scaling=1
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_early_retrans=1
net.ipv4.ip_local_port_range="40000 65000"
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 12582912 16777216
net.ipv4.tcp_wmem=4096 12582912 16777216
net.netfilter.nf_conntrack_max=10485760
net.netfilter.nf_conntrack_tcp_timeout_established=432000
net.netfilter.nf_conntrack_tcp_timeout_close_wait=10
net.netfilter.nf_conntrack_tcp_timeout_fin_wait=10
net.netfilter.nf_conntrack_tcp_timeout_time_wait=10
The ip_local_port_range
should not overlap with the container port range pool
(default: 30000 to 31000).
For development
Prerequisites
libsnappy-dev
orsnappy-devel
system package depending on your distro- Python 3.6 or higher with pyenv and pyenv-virtualenv (optional but recommneded)
- Docker 18.03 or later with docker-compose (18.09 or later is recommended)
First, you need a working manager installation. For the detailed instructions on installing the manager, please refer the manager's README and come back here again.
Common steps
Next, prepare the source clone of the agent and install from it as follows.
$ git clone https://github.com/lablup/backend.ai-agent agent
$ cd agent
$ pyenv virtualenv venv-agent
$ pyenv local venv-agent
$ pip install -U pip setuptools
$ pip install -U -r requirements-dev.txt
From now on, let's assume all shell commands are executed inside the virtualenv.
Before running, you first need to prepare "the kernel runner environment", which is composed of a dedicated Docker image that is mounted into kernel containers at runtime. Since our kernel images have two different base Linux distros, Alpine and Ubuntu, you need to build/download the krunner-env images twice as follows.
For development:
$ python -m ai.backend.agent.kernel build-krunner-env alpine3.8
$ python -m ai.backend.agent.kernel build-krunner-env ubuntu16.04
or you pull the matching version from the Docker Hub (only supported for already released versions):
$ docker pull lablup/backendai-krunner-env:19.03-alpine3.8
$ docker pull lablup/backendai-krunner-env:19.03-ubuntu16.04
Halfstack (single-node development & testing)
With the halfstack, you can run the agent simply. Note that you need a working manager running with the halfstack already!
Recommended directory structure
backend.ai-dev
manager
(git clone from the manager repo)agent
(git clone from here)common
(git clone from the common repo)
Install backend.ai-common
as an editable package in the agent (and the manager) virtualenvs
to keep the codebase up-to-date.
$ cd agent
$ pip install -U -e ../common
Steps
$ mkdir -p "./scratches"
$ cp config/halfstack.toml ./agent.toml
Then, run it (for debugging, append a --debug
flag):
$ python -m ai.backend.agent.server
To run the agent-watcher:
$ python -m ai.backend.agent.watcher
The watcher shares the same configuration TOML file with the agent.
Note that the watcher is only meaningful if the agent is installed as a systemd service
named backendai-agent.service
.
To run tests:
$ python -m flake8 src tests
$ python -m pytest -m 'not integration' tests
Deployment
Configuration
Put a TOML-formatted agent configuration (see the sample in config/sample.toml
)
in one of the following locations:
agent.toml
(current working directory)~/.config/backend.ai/agent.toml
(user-config directory)/etc/backend.ai/agent.toml
(system-config directory)
Only the first found one is used by the daemon.
The agent reads most other configurations from the etcd v3 server where the cluster administrator or the Backend.AI manager stores all the necessary settings.
The etcd address and namespace must match with the manager to make the agent paired and activated. By specifying distinguished namespaces, you may share a single etcd cluster with multiple separate Backend.AI clusters.
By default the agent uses /var/cache/scratches
directory for making temporary
home directories used by kernel containers (the /home/work
volume mounted in
containers). Note that the directory must exist in prior and the agent-running
user must have ownership of it. You can change the location by
scratch-root
option in agent.toml
.
Running from a command line
The minimal command to execute:
python -m ai.backend.agent.server
python -m ai.backend.agent.watcher
For more arguments and options, run the command with --help
option.
Example config for agent server/instances
/etc/supervisor/conf.d/agent.conf
:
[program:backend.ai-agent]
user = user
stopsignal = TERM
stopasgroup = true
command = /home/user/run-agent.sh
/home/user/run-agent.sh
:
#!/bin/sh
source /home/user/venv-agent/bin/activate
exec python -m ai.backend.agent.server
Networking
The manager and agent should run in the same local network or different networks reachable via VPNs, whereas the manager's API service must be exposed to the public network or another private network that users have access to.
The manager must be able to access TCP ports 6001, 6009, and 30000 to 31000 of the agents in default configurations. You can of course change those port numbers and ranges in the configuration.
Manager-to-Agent TCP Ports | Usage |
---|---|
6001 | ZeroMQ-based RPC calls from managers to agents |
6009 | HTTP watcher API |
30000-31000 | Port pool for in-container services |
The operation of agent itself does not require both incoming/outgoing access to the public Internet, but if the user's computation programs need the Internet, the docker containers should be able to access the public Internet (maybe via some corporate firewalls).
Agent-to-X TCP Ports | Usage |
---|---|
manager:5002 | ZeroMQ-based event push from agents to the manager |
etcd:2379 | etcd API access |
redis:6379 | Redis API access |
docker-registry:{80,443} | HTTP watcher API |
(Other hosts) | Depending on user program requirements |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for backend.ai-agent-19.9.0b6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55e63e86381c33cdcf82d931743c019123ff97106373052f73a68ad2df985e04 |
|
MD5 | 49741ef0e6e2a720d267e17363c5cb17 |
|
BLAKE2b-256 | 9cdbb46f2e0ae4b3b7fcb6b08a1864ac453a766ecac260e7c0917663e96a823f |
Hashes for backend.ai_agent-19.9.0b6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52fa299b2436d6396984d093a2d10c4d76cace1fad0da6afbd7b48b7894edb8f |
|
MD5 | 429a348ae2473321197846b8ee35ab17 |
|
BLAKE2b-256 | 0d08e382fc46d24d2243cd77893193fa2710623403204d4b9edc626791df21d2 |