BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection
Project description
🕵️ BARO: Root Cause Analysis for Microservices
BARO is an end-to-end anomaly detection and root cause analysis approach for microservices's failures. This repository includes artifacts for reuse and reproduction of experimental results presented in our FSE'24 paper titled "BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection".
Installation
Install from PyPI
pip install fse-baro
Or, build from source
git clone https://github.com/phamquiluan/baro.git && cd baro
pip install -e .
BARO has been tested on Linux and Windows, with different Python versions. More details are in INSTALL.md.
How-to-use
Data format
The data must be a pandas.DataFrame
that consists of multivariate time series metrics data. We require the data to have a column named time
that stores the timestep. Each other column stores a time series for metrics data with the name format of <service>_<metric>
. For example, the column cart_cpu
stores the CPU utilization of service cart
. A sample of valid data could be downloaded using the download_data()
method that we will demonstrated shortly below.
Sample Python commands to use BARO
BARO consists of two modules, namely MultivariateBOCPD (implemented in baro.anomaly_detection.bocpd
) and RobustScorer (implemented in baro.root_cause_analysis.robust_scorer
). We expose these two functions for users/researchers to reuse them more conveniently. The sample commands to run BARO are presented as follows,
# You can put the code here to a file named test.py
from baro.anomaly_detection import bocpd
from baro.root_cause_analysis import robust_scorer
from baro.utility import download_data, read_data
# download a sample data to data.csv
download_data()
# read data from data.csv
data = read_data("data.csv")
# perform anomaly detection
anomalies = bocpd(data)
print("Anomalies are detected at timestep:", anomalies[0])
# perform root cause analysis
root_causes = robust_scorer(data, anomalies=anomalies)["ranks"]
# print the top 5 root causes
print("Top 5 root causes:", root_causes[:5])
Expected output after running the above code (it takes around 1 minute)
$ python test.py
Downloading data.csv..: 100%|████████████████████████████████| 570k/570k [00:00<00:00, 17.1MiB/s]
Anomalies are detected at timestep: 243
Top 5 root causes: ['checkoutservice_latency', 'cartservice_mem', 'cartservice_latency', 'cartservice_cpu', 'main_mem']
Reproducibility
As presented in Table 3, BARO achieves Avg@5 of 0.91, 0.96, 0.95, 0.62, and 0.86 for CPU, MEM, DELAY, LOSS, and ALL fault types on the Online Boutique dataset. To reproduce the RCA performance of our BARO as presented in the Table 3. You can run the following commands:
Reproduce RCA performance on the Online Boutique dataset, fault type CPU
$ python main.py --dataset OnlineBoutique --fault-type cpu
Expected output
====== Reproduce BARO =====
Dataset : fse-ob
Fault type: cpu
Avg@5 Acc : 0.91
Reproduce RCA performance on the Online Boutique dataset, fault type MEM
$ python main.py --dataset OnlineBoutique --fault-type mem
Expected output
====== Reproduce BARO =====
Dataset : fse-ob
Fault type: mem
Avg@5 Acc : 0.96
Reproduce RCA performance on the Online Boutique dataset, fault type DELAY
$ python main.py --dataset OnlineBoutique --fault-type delay
Expected output
====== Reproduce BARO =====
Dataset : fse-ob
Fault type: delay
Avg@5 Acc : 0.95
Reproduce RCA performance on the Online Boutique dataset, fault type LOSS
$ python main.py --dataset OnlineBoutique --fault-type loss
Expected output
====== Reproduce BARO =====
Dataset : fse-ob
Fault type: loss
Avg@5 Acc : 0.62
Reproduce RCA performance on the Online Boutique dataset, fault type ALL
$ python main.py --dataset OnlineBoutique --fault-type all
Expected output
====== Reproduce BARO =====
Dataset : fse-ob
Fault type: all
Avg@5 Acc : 0.86
We have prepared two Google Colab Notebooks as follows,
: This notebook reproduces the RCA performance of BARO (also at tutorials/reproducibility.ipynb).
: This nodebook reproduces the output of the Multivariate BOCPD module.
Download Paper
TBD
Download Datasets
Our datasets are publicly available in Zenodo repository with the following information:
- Dataset DOI:
- Dataset URL: https://zenodo.org/records/11046533
Running Time & Instrumentation Cost
Please refer to our docs/running_time_and_instrumentation_cost.md document.
Citation
@inproceedings{pham2024baro,
title={BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection},
author={Luan Pham, Huong Ha, and Hongyu Zhang},
booktitle={Proceedings of the ACM on Software Engineering, Vol 1},
year={2024},
organization={ACM}
}
Contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.