Skip to main content

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection

Project description

๐Ÿ•ต๏ธ BARO: Robust Root Cause Analysis for Microservice Systems

DOI pypi package Downloads CircleCI Build and test Upload Python Package

BARO is an end-to-end anomaly detection and root cause analysis approach for microservices failures. This repository includes artifacts for reuse and reproduction of experimental results presented in our FSE'24 paper titled "BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection".

Table of Contents

Installation

Open In Colab

Clone BARO from GitHub

git clone https://github.com/phamquiluan/baro.git && cd baro

Install BARO from PyPI

# install BARO from PyPI
pip install fse-baro

OR, build BARO from source

# build BARO from source
pip install -e .

BARO has been tested on Linux and Windows, with different Python versions. More details are in INSTALL.md.

How-to-use

Data format

The data must be a pandas.DataFrame that consists of multivariate time series metrics data. We require the data to have a column named time that stores the timestep. Each other column stores a time series for metrics data with the name format of <service>_<metric>. For example, the column cart_cpu stores the CPU utilization of service cart. A sample of valid data could be downloaded using the download_data() method that we will demonstrate shortly below.

Basic usage example

BARO consists of two modules, namely MultivariateBOCPD (implemented in baro.anomaly_detection.bocpd) and RobustScorer (implemented in baro.root_cause_analysis.robust_scorer). We expose these two modules for users/researchers to reuse them more conveniently. The basic sample commands to run BARO are presented as follows,

# You can put the code here to a file named test.py
from baro.anomaly_detection import bocpd
from baro.root_cause_analysis import robust_scorer
from baro.utility import download_data, read_data

# download a sample data to data.csv
download_data()

# read data from data.csv
data = read_data("data.csv")

# perform anomaly detection 
anomalies = bocpd(data) 
print("Anomalies are detected at timestep:", anomalies[0])

# perform root cause analysis
root_causes = robust_scorer(data, anomalies=anomalies)["ranks"]

# print the top 5 root causes
print("Top 5 root causes:", root_causes[:5])
Expected output after running the above code (it takes around 1 minute)
python test.py
Downloading data.csv..: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 570k/570k [00:00<00:00, 17.1MiB/s]
Anomalies are detected at timestep: 243
Top 5 root causes: ['checkoutservice_latency', 'cartservice_mem', 'cartservice_latency', 'cartservice_cpu', 'main_mem']

๐Ÿ‘‰ For more detailed tutorials, you can also check this tutorials/how-to-use-baro.ipynb.

Reproducibility

We have provided a file named main.py to assist in reproducing the results of our paper, which can be run using Python with the following syntax:

python main.py [-h] [--anomaly-detection] [--saved] [--dataset DATASET] [--fault-type FAULT_TYPE] [--rq4] [--eval-metric EVAL_METRIC]

The description for the arguments/options of the file main.py are as follows:

options:
  -h, --help            show this help message and exit
  --anomaly-detection   Reproduce anomaly detection results.
  --saved               Use saved anomaly detection results to reproduce the
                        presented results without rerunning anomaly detection.
  --dataset DATASET     Choose a dataset to analyze. Options:
                        ['OnlineBoutique', 'SockShop', and 'TrainTicket'].
  --fault-type FAULT_TYPE
                        Specify the fault type for root cause analysis.
                        Options: ['cpu', 'mem', 'delay', 'loss', and 'all'].
                        If 'all' is selected, the program will run the root
                        cause analysis for all fault types.
  --rq4                 Reproduce RQ4 results.
  --eval-metric EVAL_METRIC
                        Evaluation metric for RQ4. Options: ['top1', 'top3',
                        'avg5']. Default: 'avg5'.

Reproduce RQ1 - Anomaly Detection Effectiveness

To reproduce the anomaly detection performance of BARO, as presented in Table 2. You can run the following commands (the corresponding dataset will be automatically downloaded and extracted to folder ./data):

python main.py --dataset OnlineBoutique --anomaly-detection
Expected output after running the above code (it takes around two hours)

The results are a bit better than the numbers presented in the paper (Table 2).

Downloading fse-ob.zip..: 100% 151M/151M [04:12<00:00, 597kiB/s]
Running:   2% 2/100 [02:51<2:20:31, 86.03s/it]
====== Reproduce BOCPD =====
Dataset: fse-ob
Precision: 0.76
Recall   : 1.00
F1       : 0.87

๐Ÿ‘‰ You can also checked the tutorials/reproduce_multivariate_bocpd.ipynb to reproduce the saved anomalies in our datasets.

Reproduce RQ2 - Root Cause Analysis Effectiveness

As presented in Table 3, BARO achieves Avg@5 of 0.91, 0.96, 0.95, 0.62, and 0.86 for CPU, MEM, DELAY, LOSS, and ALL fault types on the Online Boutique dataset. To reproduce the RCA performance of our BARO as presented in the Table 3. You can run the following commands:

# For Linux users
python main.py --dataset OnlineBoutique --fault-type cpu \
  && python main.py --dataset OnlineBoutique --fault-type mem \
  && python main.py --dataset OnlineBoutique --fault-type delay \
  && python main.py --dataset OnlineBoutique --fault-type loss \
  && python main.py --dataset OnlineBoutique --fault-type all \
# For Windows users
python main.py --dataset OnlineBoutique --fault-type cpu && python main.py --dataset OnlineBoutique --fault-type mem && python main.py --dataset OnlineBoutique --fault-type delay && python main.py --dataset OnlineBoutique --fault-type loss && python main.py --dataset OnlineBoutique --fault-type all 
Expected output after running the above code (it takes few seconds)
Running: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 25/25 [00:02<00:00, 11.94it/s]
====== Reproduce BARO =====
Dataset   : fse-ob
Fault type: cpu
Avg@5 Acc : 0.91

Running: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 25/25 [00:02<00:00, 12.10it/s]
====== Reproduce BARO =====
Dataset   : fse-ob
Fault type: mem
Avg@5 Acc : 0.96

Running: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 25/25 [00:01<00:00, 12.73it/s]
====== Reproduce BARO =====
Dataset   : fse-ob
Fault type: delay
Avg@5 Acc : 0.95

Running: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 25/25 [00:02<00:00, 12.35it/s]
====== Reproduce BARO =====
Dataset   : fse-ob
Fault type: loss
Avg@5 Acc : 0.62

Running: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100/100 [00:06<00:00, 15.82it/s]
====== Reproduce BARO =====
Dataset   : fse-ob
Fault type: all
Avg@5 Acc : 0.86

Reproduce RQ3 - Components of BARO

Our RQ3 relies on the experimental results of RQ2, which we reproduced above.

Reproduce RQ4 - Sensitivity Analysis

As presented in Figure 5, BARO maintains stable accuracy on the Online Boutique dataset when we vary t_bias from -40 to 40. To reproduce these results, for example, you can run the following command to obtain the Avg@5 scores on the Online Boutique dataset:

python main.py --dataset OnlineBoutique --rq4 --eval-metric avg5 
Expected output after running the above code (it takes few minutes)

The output list presents the Avg@5 scores when we vary t_bias. You can see that BARO can maintain a stable performance.

Running: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [04:00<00:00,  6.02s/it]
[0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.86, 0.86, 0.87, 0.87, 0.86, 0.87, 0.87, 0.86, 0.86, 0.86, 0.86, 0.85, 0.85, 0.85, 0.85, 0.84, 0.85, 0.85, 0.85, 0.85]

Download Paper

Our paper could be downloaded at docs/paper.pdf or at doi/10.1145/3660805.

Download Datasets

Our datasets and their description are publicly available in Zenodo repository with the following information:

We also provide utility functions to download our datasets using Python. The downloaded datasets will be available at directory data.

from baro.utility import (
    download_online_boutique_dataset,
    download_sock_shop_dataset,
    download_train_ticket_dataset,
)
download_online_boutique_dataset()
download_sock_shop_dataset()
download_train_ticket_dataset()
Expected output after running the above code (it takes few minutes to download our datasets)
$ python test.py
Downloading fse-ob.zip..: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 151M/151M [01:03<00:00, 2.38MiB/s]
Downloading fse-ss.zip..: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 127M/127M [00:23<00:00, 5.49MiB/s]
Downloading fse-tt.zip..: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 286M/286M [00:56<00:00, 5.10MiB/s]

Running Time and Instrumentation Cost

Please refer to our docs/running_time_and_instrumentation_cost.md document.

Supplementary materials

You can download our supplementary materials from docs/fse_baro_supplementary_material.pdf

Citation

@article{pham2024baro,
author = {Pham, Luan and Ha, Huong and Zhang, Hongyu},
title = {BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection},
year = {2024},
issue_date = {July 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {1},
number = {FSE},
url = {https://doi.org/10.1145/3660805},
doi = {10.1145/3660805},
journal = {Proc. ACM Softw. Eng.},
month = {jul},
articleno = {98},
numpages = {24},
keywords = {Anomaly Detection, Microservice Systems, Root Cause Analysis}
}

Contact

phamquiluan@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fse_baro-0.2.2.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fse_baro-0.2.2-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file fse_baro-0.2.2.tar.gz.

File metadata

  • Download URL: fse_baro-0.2.2.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fse_baro-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0103a565ed0a5b4334bdb2f69416f01455820b653d45d2562d95e32a0cdd3494
MD5 dc40082803ac16b39593a14b1b9ec60c
BLAKE2b-256 95bede815145010a6af250f2ec1d7286e4d349fc06fe686501b104a38d7c1580

See more details on using hashes here.

File details

Details for the file fse_baro-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: fse_baro-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fse_baro-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c38541a35ccc513531c4e03ea33171041c82ef7502f9ee2453699dd392385f85
MD5 740ed8d883ed7c47cc21d8b2920570dc
BLAKE2b-256 74240024e47724388151331eea0772a2b5fee45f1fe3177d5f92968758530af7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page