Skip to main content

Multi-Agent Tensor Uncertainty for LLM-based multi-agent systems

Project description

MATU: Multi-Agent Tensor Uncertainty

Python License: MIT CI

MATU quantifies uncertainty for LLM-based multi-agent systems from repeated conversation trajectories. This repository supports the ACL 2026 main paper "Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition."

MATU is log-first: the core input is a conversation log JSON following data/LOG_FORMAT.md. The included generation scripts are examples, not required infrastructure.

Paper: Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition (ACL 2026 main)

Dependencies

Core dependencies are specified in pyproject.toml, requirements.txt, and environment.yml.

numpy>=1.24
tqdm>=4.66
tensorly>=0.8
sentence-transformers>=2.6
transformers>=4.40
torch>=2.1
scikit-learn>=1.3
datasets>=2.18
PyYAML>=6.0

Optional log-generation examples:

openai>=1.0
camel-ai>=0.2.0

Installation

MATU is packaged from source as matu-uq. After installation, both command names are available:

pip install -e .
matu --help
matu-uq --help

For development and tests:

pip install -e ".[dev]"

For the optional CAMEL/OpenAI example:

pip install -e ".[examples]"

Conda users can create the same core environment:

conda env create -f environment.yml
conda activate matu-uq

Local secrets and machine-specific paths should live in .env files, never in source:

cp .env.example .env
cp quick_start/.env.example quick_start/.env

OpenAI credentials are only needed for the optional CAMEL/GPT log-generation example. The included quick-start evaluation does not require an API key.

Architecture

Conversation logs
      |
      v
Role-wise embeddings
      |
      v
Run / role / step tensor
      |
      v
CP-2 / PARAFAC2 decomposition
      |
      v
MATU uncertainty score
      |
      v
AUROC / AUARC evaluation

Optional baselines, such as EigV, start from the same conversation logs and are evaluated with the same labels. More details are in docs/architecture.md.

Quick Start

The quick_start/ folder provides pre-computed sample artifacts so users can evaluate MATU immediately, inspect the intermediate files, or re-run selected stages. The MATH sample uses Qwen2.5 conversation logs with Qwen3 embeddings. The MMLU sample uses the original AutoGen + Qwen2.5 artifact that matches Table 2 in the paper.

Raw embedding matrices are stored inside zip archives rather than committed as standalone .pkl files. Use python -m zipfile -e ... below; it works on Linux, macOS, and Windows without requiring a separate unzip executable.

Quick Start Files

File Description
quick_start/data/conversation_logs_Math_qwen2.5.json MATH repeated conversation logs for the public quick-start sample.
quick_start/data/conversation_logs_MMLU_Autogen_qwen2.5.json MMLU AutoGen conversation log sample for the paper-matching result.
quick_start/data/embeddings_Math_qwen2.5_qwen3.zip Zipped Qwen3 user and assistant embedding matrices for the MATH sample. Extract before inspecting or reusing the raw matrices.
quick_start/data/embeddings_MMLU_Autogen_qwen2.5.zip Zipped AutoGen analyst, verifier, and star embedding matrices for the MMLU sample. Extract before inspecting or reusing the raw matrices.
quick_start/results/fit_dict_Math_Assistonly_qwen2.5_qwen3embedding.pkl Included MATU fit curves for MATH.
quick_start/results/uncertainty_Math_Assistonly_qwen2.5.pkl Included scalar MATU uncertainty for MATH.
quick_start/results/accuracy_dict_Math_qwen2.5.pkl MATH repeated-run correctness labels.
quick_start/results/fit_dict_MMLU_Autogen_qwen2.5.pkl Included MATU fit curves for the MMLU AutoGen paper artifact.
quick_start/results/accuracy_dict_MMLU_Autogen_qwen2.5.pkl MMLU AutoGen repeated-run correctness labels.
quick_start/results/saup_scores_Math_qwen2.5.pkl Included SAUP-Multiple baseline scores for comparison.

Option A: Standalone Quick-Start Scripts

Set up the quick-start environment file:

cp quick_start/.env.example quick_start/.env

Directly evaluate the included MATU results without re-running embedding or CP-2:

python quick_start/code/04_evaluate_reference_results.py --sample all

Evaluate the included SAUP-Multiple baseline:

python quick_start/code/05_evaluate_baselines.py

To inspect the packaged embedding matrices:

mkdir -p quick_start/generated/reference_embeddings/math
python -m zipfile -e quick_start/data/embeddings_Math_qwen2.5_qwen3.zip quick_start/generated/reference_embeddings/math

mkdir -p quick_start/generated/reference_embeddings/mmlu_autogen_qwen
python -m zipfile -e quick_start/data/embeddings_MMLU_Autogen_qwen2.5.zip quick_start/generated/reference_embeddings/mmlu_autogen_qwen

The MATH archive creates:

quick_start/generated/reference_embeddings/math/user_embedding_matrices_Math_qwen2.5_qwen3.pkl
quick_start/generated/reference_embeddings/math/assistant_embedding_matrices_Math_qwen2.5_qwen3.pkl

The MMLU AutoGen archive creates:

quick_start/generated/reference_embeddings/mmlu_autogen_qwen/autogen_analyst_embedding_matrices_MMLU_HF_qwen2.5.pkl
quick_start/generated/reference_embeddings/mmlu_autogen_qwen/autogen_verifier_embedding_matrices_MMLU_HF_qwen2.5.pkl
quick_start/generated/reference_embeddings/mmlu_autogen_qwen/autogen_star_embedding_matrices_MMLU_HF_qwen2.5.pkl

To run CP-2 from the extracted MATH reference embeddings without downloading an embedding model:

mkdir -p quick_start/generated/results
python -m matu.cp2_matu \
  --embeddings \
  quick_start/generated/reference_embeddings/math/user_embedding_matrices_Math_qwen2.5_qwen3.pkl \
  quick_start/generated/reference_embeddings/math/assistant_embedding_matrices_Math_qwen2.5_qwen3.pkl \
  --out quick_start/generated/results/matu_scores.pkl \
  --legacy_fit_out quick_start/generated/results/fit_dict_generated.pkl \
  --max_rank 50

python quick_start/code/03_fit_to_uncertainty_generated.py
python quick_start/code/04_evaluate_generated_results.py

To re-embed the included MATH conversation logs from scratch, run:

python quick_start/code/01_embed_reference_logs.py
python quick_start/code/02_run_cp2_from_generated_embeddings.py
python quick_start/code/03_fit_to_uncertainty_generated.py
python quick_start/code/04_evaluate_generated_results.py

The re-embedding step downloads or loads Qwen/Qwen3-Embedding-0.6B; a GPU is recommended but not required for small tests.

Option B: Using the MATU CLI

Install from source:

pip install -e .

Run the same core stages through the CLI:

matu embed \
  --logs quick_start/data/conversation_logs_Math_qwen2.5.json \
  --out_dir quick_start/generated/embeddings \
  --roles user assistant

matu cp2 \
  --embeddings \
  quick_start/generated/embeddings/user_embedding_matrices.pkl \
  quick_start/generated/embeddings/assistant_embedding_matrices.pkl \
  --out quick_start/generated/results/matu_scores.pkl \
  --legacy_fit_out quick_start/generated/results/fit_dict_generated.pkl \
  --max_rank 50

matu fit \
  --fit_dict quick_start/generated/results/matu_scores.pkl \
  --out quick_start/generated/results/uncertainty_generated.pkl

matu eval \
  --uncertainty quick_start/generated/results/uncertainty_generated.pkl \
  --labels quick_start/results/accuracy_dict_Math_qwen2.5.pkl \
  --score_mode raw

Expected Results

The included .pkl files reproduce the following quick-start metrics:

Sample Paper Reference Included Output Command
MATH + Qwen2.5 + MATU Table 1: AUROC 0.7089, AUARC 0.9064 AUROC 0.7205, AUARC 0.9017 python quick_start/code/04_evaluate_reference_results.py --sample math-qwen
MMLU + AutoGen + Qwen2.5 + MATU Table 2: AUROC 0.7315, AUARC 0.8833 AUROC 0.7315, AUARC 0.8834 python quick_start/code/04_evaluate_reference_results.py --sample mmlu-autogen-qwen
MATH + Qwen2.5 + SAUP-Multiple Baseline comparison AUROC 0.6097, AUARC 0.8722 python quick_start/code/05_evaluate_baselines.py

The MATH artifact was re-run for public release packaging, so it is not bit-for-bit identical to the paper table; the difference is within expected CP-2 run-to-run tolerance. The MMLU AutoGen sample uses the original paper artifact, so it matches Table 2 up to display rounding.

Expected output for both MATU samples:

MATH + Qwen2.5-7B
Tasks: 400
Mean accuracy: 0.8383
AUROC: 0.7205
AUARC: 0.9017

Paper Table 2: MMLU + AutoGen + Qwen2.5-7B
Tasks: 400
Mean accuracy: 0.7785
AUROC: 0.7315
AUARC: 0.8834

Full Pipeline

To run MATU on your own conversation logs:

# 1. Embed role-specific conversation trajectories.
matu embed --config configs/default.yaml \
  --logs path/to/conversation_logs.json \
  --out_dir outputs/embeddings \
  --roles user assistant

# 2. Run CP-2 / PARAFAC2 tensor scoring.
matu cp2 --config configs/default.yaml \
  --embeddings outputs/embeddings/user_embedding_matrices.pkl outputs/embeddings/assistant_embedding_matrices.pkl \
  --out outputs/matu_scores.pkl \
  --legacy_fit_out outputs/fit_dict.pkl

# 3. Convert fit curves to scalar uncertainty.
matu fit --config configs/default.yaml \
  --fit_dict outputs/matu_scores.pkl \
  --out outputs/uncertainty.pkl

# 4. Evaluate uncertainty against repeated-run labels.
matu eval --config configs/default.yaml \
  --uncertainty outputs/uncertainty.pkl \
  --labels path/to/accuracy_dict.pkl

Optional baseline:

matu eigv \
  --logs path/to/conversation_logs.json \
  --mode final \
  --out outputs/eigv_final.pkl

Pipeline Stages

Stage Command Description Output
1. Log collection examples/generate_logs_hf_qwen.py or your own agent framework Collect repeated multi-agent conversation trajectories. Conversation log JSON
2. Embedding matu embed Convert each role's turns into trajectory matrices. <role>_embedding_matrices.pkl
3. CP-2 / MATU matu cp2 Run rank-wise tensor decomposition over repeated trajectories. Structured MATU scores and optional legacy fit_dict
4. Uncertainty conversion matu fit Convert rank-wise fit curves into scalar uncertainty. uncertainty.pkl
5. Evaluation matu eval Compute AUROC and AUARC from repeated-run labels. Console metrics
Baseline matu eigv Compute EigV agreement baseline from logs. Baseline score pickle

Key uncertainty definition for legacy fit_dict files:

uncertainty = sum_R (1 - fit_R)

Output Directory Structure

outputs/
|-- conversation_logs.json
|-- embeddings/
|   |-- user_embedding_matrices.pkl
|   `-- assistant_embedding_matrices.pkl
|-- matu_scores.pkl
|-- fit_dict.pkl
|-- uncertainty.pkl
`-- eigv_final.pkl

quick_start/generated/
|-- embeddings/
|-- reference_embeddings/
`-- results/

Generated outputs, extracted embeddings, model caches, and build artifacts are ignored by git.

Configuration

All default paths and hyperparameters are documented in configs/default.yaml. CLI flags override config file values.

Parameter Default Description
embedding.model Qwen/Qwen3-Embedding-0.6B Sentence-transformer embedding model.
embedding.roles [user, assistant] Roles to extract from each conversation turn.
cp2.min_rank 1 Minimum CP-2 rank.
cp2.max_rank 50 Maximum CP-2 rank. Reduce for smoke tests.
cp2.max_iter 25 ALS iterations per rank.
cp2.seed 0 Factor initialization seed.
cp2.combine_mode interleave How role/run matrices are assembled.
evaluation.error_rule any_incorrect Repeated-run error event for AUROC.

Data Sources

The quick start is self-contained and does not require benchmark downloads. Full paper-scale experiments use public datasets including MATH, MMLU, MoreHopQA, and HumanEval/EvalPlus. Source links and download snippets are in data/README.md.

Hardware Requirements

Stage Hardware Estimated Time
Direct quick-start evaluation CPU only Seconds
Unzip reference embeddings CPU only Seconds
Re-embedding sample logs GPU recommended, CPU possible Minutes, depending on hardware
CP-2 on quick-start embeddings CPU possible, GPU not required Minutes to longer for rank 50
Optional CAMEL/GPT log generation CPU plus OpenAI API key API dependent
Full paper-scale runs GPU recommended Dataset and agent-framework dependent

For a smoke test, reduce MATU_MAX_RANK in quick_start/.env or cp2.max_rank in configs/default.yaml.

Tests And Makefile

Common checks are collected in the Makefile:

make install-dev
make test
make quick-eval
make paper-eval
make check
make clean

make test runs the unit tests in tests/. make check runs tests, compileall, CLI help, MATH quick evaluation, and the MMLU paper-matching evaluation. make clean removes Python caches, build outputs, and :Zone.Identifier sidecar files.

Docker

Build a CPU-ready image:

docker build -t matu-uq .

Run quick-start evaluation in the container:

docker run --rm matu-uq make quick-eval
docker run --rm matu-uq make paper-eval

For local outputs:

docker run --rm -v "$(pwd)/outputs:/app/outputs" matu-uq matu --help

PyPI Release

The package metadata is ready under the name matu-uq. The GitHub Actions workflow in .github/workflows/publish.yml uses PyPI Trusted Publishing. Before the first release, configure PyPI with:

project name: matu-uq
repository: tiejin98/MATU
workflow: publish.yml
environment: pypi

Then create a GitHub release. After the first successful PyPI upload, add the PyPI badge:

[![PyPI](https://img.shields.io/pypi/v/matu-uq.svg)](https://pypi.org/project/matu-uq/)

Project Structure

MATU/
|-- matu/                  # Main package and CLI
|-- baselines/             # EigV baseline
|-- configs/               # YAML configuration
|-- data/                  # Log and artifact format docs
|-- docs/                  # Architecture notes
|-- examples/              # Optional log collectors
|-- quick_start/           # Reproducible sample artifacts and scripts
|-- tests/                 # Lightweight unit tests
|-- requirements.txt
|-- pyproject.toml
|-- Makefile
|-- Dockerfile
`-- .env.example

Citation

If you find this work useful, please cite:

@inproceedings{chen2026every,
  title = {Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition},
  author = {Chen, Tiejin and Yao, Huaiyuan and Chen, Jia and Papalexakis, Evangelos E. and Wei, Hua},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
  year = {2026}
}

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matu_uq-0.1.0.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matu_uq-0.1.0-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file matu_uq-0.1.0.tar.gz.

File metadata

  • Download URL: matu_uq-0.1.0.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for matu_uq-0.1.0.tar.gz
Algorithm Hash digest
SHA256 de149f701878c5b4b8eb9a32bf6a57f14842dfe25e2f71a90bd2397b2279f09b
MD5 3b13795515f29ff8377a683f2a60c60f
BLAKE2b-256 be37befae29bfccecbed43bfcdd273599ac41376909f13d073a531d9fdc82b7b

See more details on using hashes here.

File details

Details for the file matu_uq-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: matu_uq-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for matu_uq-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60d1ecbd412fe993a480535246d8c7465f2e67d432eb797b771e8dbd998fb636
MD5 d277379f7cda4e4229da2eb34766ae2c
BLAKE2b-256 0d00ba255fae561f9c9b240a9b83c68dbb99ff137f94d724ab4e9237af387cf2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page