Multi-Agent Tensor Uncertainty for LLM-based multi-agent systems

These details have not been verified by PyPI

Project links

Project description

MATU: Multi-Agent Tensor Uncertainty

MATU quantifies uncertainty for LLM-based multi-agent systems from repeated conversation trajectories. This repository supports the ACL 2026 main paper "Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition."

MATU is log-first: the core input is a conversation log JSON following data/LOG_FORMAT.md. The included generation scripts are examples, not required infrastructure.

Paper: Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition (ACL 2026 main)

Dependencies

Core dependencies are specified in pyproject.toml, requirements.txt, and environment.yml.

numpy>=1.24
tqdm>=4.66
tensorly>=0.8
sentence-transformers>=2.6
transformers>=4.40
torch>=2.1
scikit-learn>=1.3
datasets>=2.18
PyYAML>=6.0

Optional log-generation examples:

openai>=1.0
camel-ai>=0.2.0

Installation

MATU is packaged from source as matu-uq. After installation, both command names are available:

pip install -e .
matu --help
matu-uq --help

For development and tests:

pip install -e ".[dev]"

For the optional CAMEL/OpenAI example:

pip install -e ".[examples]"

Conda users can create the same core environment:

conda env create -f environment.yml
conda activate matu-uq

Local secrets and machine-specific paths should live in .env files, never in source:

cp .env.example .env
cp quick_start/.env.example quick_start/.env

OpenAI credentials are only needed for the optional CAMEL/GPT log-generation example. The included quick-start evaluation does not require an API key.

Architecture

Conversation logs
      |
      v
Role-wise embeddings
      |
      v
Run / role / step tensor
      |
      v
CP-2 / PARAFAC2 decomposition
      |
      v
MATU uncertainty score
      |
      v
AUROC / AUARC evaluation

Optional baselines, such as EigV, start from the same conversation logs and are evaluated with the same labels. More details are in docs/architecture.md.

Quick Start

The quick_start/ folder provides pre-computed sample artifacts so users can evaluate MATU immediately, inspect the intermediate files, or re-run selected stages. The MATH sample uses Qwen2.5 conversation logs with Qwen3 embeddings. The MMLU sample uses the original AutoGen + Qwen2.5 artifact that matches Table 2 in the paper.

Raw embedding matrices are stored inside zip archives rather than committed as standalone .pkl files. Use python -m zipfile -e ... below; it works on Linux, macOS, and Windows without requiring a separate unzip executable.

Quick Start Files

File	Description
`quick_start/data/conversation_logs_Math_qwen2.5.json`	MATH repeated conversation logs for the public quick-start sample.
`quick_start/data/conversation_logs_MMLU_Autogen_qwen2.5.json`	MMLU AutoGen conversation log sample for the paper-matching result.
`quick_start/data/embeddings_Math_qwen2.5_qwen3.zip`	Zipped Qwen3 user and assistant embedding matrices for the MATH sample. Extract before inspecting or reusing the raw matrices.
`quick_start/data/embeddings_MMLU_Autogen_qwen2.5.zip`	Zipped AutoGen analyst, verifier, and star embedding matrices for the MMLU sample. Extract before inspecting or reusing the raw matrices.
`quick_start/results/fit_dict_Math_Assistonly_qwen2.5_qwen3embedding.pkl`	Included MATU fit curves for MATH.
`quick_start/results/uncertainty_Math_Assistonly_qwen2.5.pkl`	Included scalar MATU uncertainty for MATH.
`quick_start/results/accuracy_dict_Math_qwen2.5.pkl`	MATH repeated-run correctness labels.
`quick_start/results/fit_dict_MMLU_Autogen_qwen2.5.pkl`	Included MATU fit curves for the MMLU AutoGen paper artifact.
`quick_start/results/accuracy_dict_MMLU_Autogen_qwen2.5.pkl`	MMLU AutoGen repeated-run correctness labels.
`quick_start/results/saup_scores_Math_qwen2.5.pkl`	Included SAUP-Multiple baseline scores for comparison.

Option A: Standalone Quick-Start Scripts

Set up the quick-start environment file:

cp quick_start/.env.example quick_start/.env

Directly evaluate the included MATU results without re-running embedding or CP-2:

python quick_start/code/04_evaluate_reference_results.py --sample all

Evaluate the included SAUP-Multiple baseline:

python quick_start/code/05_evaluate_baselines.py

To inspect the packaged embedding matrices:

mkdir -p quick_start/generated/reference_embeddings/math
python -m zipfile -e quick_start/data/embeddings_Math_qwen2.5_qwen3.zip quick_start/generated/reference_embeddings/math

mkdir -p quick_start/generated/reference_embeddings/mmlu_autogen_qwen
python -m zipfile -e quick_start/data/embeddings_MMLU_Autogen_qwen2.5.zip quick_start/generated/reference_embeddings/mmlu_autogen_qwen

The MATH archive creates:

quick_start/generated/reference_embeddings/math/user_embedding_matrices_Math_qwen2.5_qwen3.pkl
quick_start/generated/reference_embeddings/math/assistant_embedding_matrices_Math_qwen2.5_qwen3.pkl

The MMLU AutoGen archive creates:

quick_start/generated/reference_embeddings/mmlu_autogen_qwen/autogen_analyst_embedding_matrices_MMLU_HF_qwen2.5.pkl
quick_start/generated/reference_embeddings/mmlu_autogen_qwen/autogen_verifier_embedding_matrices_MMLU_HF_qwen2.5.pkl
quick_start/generated/reference_embeddings/mmlu_autogen_qwen/autogen_star_embedding_matrices_MMLU_HF_qwen2.5.pkl

To run CP-2 from the extracted MATH reference embeddings without downloading an embedding model:

mkdir -p quick_start/generated/results
python -m matu.cp2_matu \
  --embeddings \
  quick_start/generated/reference_embeddings/math/user_embedding_matrices_Math_qwen2.5_qwen3.pkl \
  quick_start/generated/reference_embeddings/math/assistant_embedding_matrices_Math_qwen2.5_qwen3.pkl \
  --out quick_start/generated/results/matu_scores.pkl \
  --legacy_fit_out quick_start/generated/results/fit_dict_generated.pkl \
  --max_rank 50

python quick_start/code/03_fit_to_uncertainty_generated.py
python quick_start/code/04_evaluate_generated_results.py

To re-embed the included MATH conversation logs from scratch, run:

python quick_start/code/01_embed_reference_logs.py
python quick_start/code/02_run_cp2_from_generated_embeddings.py
python quick_start/code/03_fit_to_uncertainty_generated.py
python quick_start/code/04_evaluate_generated_results.py

The re-embedding step downloads or loads Qwen/Qwen3-Embedding-0.6B; a GPU is recommended but not required for small tests.

Option B: Using the MATU CLI

Install from source:

pip install -e .

Run the same core stages through the CLI:

matu embed \
  --logs quick_start/data/conversation_logs_Math_qwen2.5.json \
  --out_dir quick_start/generated/embeddings \
  --roles user assistant

matu cp2 \
  --embeddings \
  quick_start/generated/embeddings/user_embedding_matrices.pkl \
  quick_start/generated/embeddings/assistant_embedding_matrices.pkl \
  --out quick_start/generated/results/matu_scores.pkl \
  --legacy_fit_out quick_start/generated/results/fit_dict_generated.pkl \
  --max_rank 50

matu fit \
  --fit_dict quick_start/generated/results/matu_scores.pkl \
  --out quick_start/generated/results/uncertainty_generated.pkl

matu eval \
  --uncertainty quick_start/generated/results/uncertainty_generated.pkl \
  --labels quick_start/results/accuracy_dict_Math_qwen2.5.pkl \
  --score_mode raw

Expected Results

The included .pkl files reproduce the following quick-start metrics:

Sample	Paper Reference	Included Output	Command
MATH + Qwen2.5 + MATU	Table 1: AUROC 0.7089, AUARC 0.9064	AUROC 0.7205, AUARC 0.9017	`python quick_start/code/04_evaluate_reference_results.py --sample math-qwen`
MMLU + AutoGen + Qwen2.5 + MATU	Table 2: AUROC 0.7315, AUARC 0.8833	AUROC 0.7315, AUARC 0.8834	`python quick_start/code/04_evaluate_reference_results.py --sample mmlu-autogen-qwen`
MATH + Qwen2.5 + SAUP-Multiple	Baseline comparison	AUROC 0.6097, AUARC 0.8722	`python quick_start/code/05_evaluate_baselines.py`

The MATH artifact was re-run for public release packaging, so it is not bit-for-bit identical to the paper table; the difference is within expected CP-2 run-to-run tolerance. The MMLU AutoGen sample uses the original paper artifact, so it matches Table 2 up to display rounding.

Expected output for both MATU samples:

MATH + Qwen2.5-7B
Tasks: 400
Mean accuracy: 0.8383
AUROC: 0.7205
AUARC: 0.9017

Paper Table 2: MMLU + AutoGen + Qwen2.5-7B
Tasks: 400
Mean accuracy: 0.7785
AUROC: 0.7315
AUARC: 0.8834

Full Pipeline

To run MATU on your own conversation logs:

# 1. Embed role-specific conversation trajectories.
matu embed --config configs/default.yaml \
  --logs path/to/conversation_logs.json \
  --out_dir outputs/embeddings \
  --roles user assistant

# 2. Run CP-2 / PARAFAC2 tensor scoring.
matu cp2 --config configs/default.yaml \
  --embeddings outputs/embeddings/user_embedding_matrices.pkl outputs/embeddings/assistant_embedding_matrices.pkl \
  --out outputs/matu_scores.pkl \
  --legacy_fit_out outputs/fit_dict.pkl

# 3. Convert fit curves to scalar uncertainty.
matu fit --config configs/default.yaml \
  --fit_dict outputs/matu_scores.pkl \
  --out outputs/uncertainty.pkl

# 4. Evaluate uncertainty against repeated-run labels.
matu eval --config configs/default.yaml \
  --uncertainty outputs/uncertainty.pkl \
  --labels path/to/accuracy_dict.pkl

Optional baseline:

matu eigv \
  --logs path/to/conversation_logs.json \
  --mode final \
  --out outputs/eigv_final.pkl

Pipeline Stages

Stage	Command	Description	Output
1. Log collection	`examples/generate_logs_hf_qwen.py` or your own agent framework	Collect repeated multi-agent conversation trajectories.	Conversation log JSON
2. Embedding	`matu embed`	Convert each role's turns into trajectory matrices.	`<role>_embedding_matrices.pkl`
3. CP-2 / MATU	`matu cp2`	Run rank-wise tensor decomposition over repeated trajectories.	Structured MATU scores and optional legacy `fit_dict`
4. Uncertainty conversion	`matu fit`	Convert rank-wise fit curves into scalar uncertainty.	`uncertainty.pkl`
5. Evaluation	`matu eval`	Compute AUROC and AUARC from repeated-run labels.	Console metrics
Baseline	`matu eigv`	Compute EigV agreement baseline from logs.	Baseline score pickle

Key uncertainty definition for legacy fit_dict files:

uncertainty = sum_R (1 - fit_R)

Output Directory Structure

outputs/
|-- conversation_logs.json
|-- embeddings/
|   |-- user_embedding_matrices.pkl
|   `-- assistant_embedding_matrices.pkl
|-- matu_scores.pkl
|-- fit_dict.pkl
|-- uncertainty.pkl
`-- eigv_final.pkl

quick_start/generated/
|-- embeddings/
|-- reference_embeddings/
`-- results/

Generated outputs, extracted embeddings, model caches, and build artifacts are ignored by git.

Configuration

All default paths and hyperparameters are documented in configs/default.yaml. CLI flags override config file values.

Parameter	Default	Description
`embedding.model`	`Qwen/Qwen3-Embedding-0.6B`	Sentence-transformer embedding model.
`embedding.roles`	`[user, assistant]`	Roles to extract from each conversation turn.
`cp2.min_rank`	`1`	Minimum CP-2 rank.
`cp2.max_rank`	`50`	Maximum CP-2 rank. Reduce for smoke tests.
`cp2.max_iter`	`25`	ALS iterations per rank.
`cp2.seed`	`0`	Factor initialization seed.
`cp2.combine_mode`	`interleave`	How role/run matrices are assembled.
`evaluation.error_rule`	`any_incorrect`	Repeated-run error event for AUROC.

Data Sources

The quick start is self-contained and does not require benchmark downloads. Full paper-scale experiments use public datasets including MATH, MMLU, MoreHopQA, and HumanEval/EvalPlus. Source links and download snippets are in data/README.md.

Hardware Requirements

Stage	Hardware	Estimated Time
Direct quick-start evaluation	CPU only	Seconds
Unzip reference embeddings	CPU only	Seconds
Re-embedding sample logs	GPU recommended, CPU possible	Minutes, depending on hardware
CP-2 on quick-start embeddings	CPU possible, GPU not required	Minutes to longer for rank 50
Optional CAMEL/GPT log generation	CPU plus OpenAI API key	API dependent
Full paper-scale runs	GPU recommended	Dataset and agent-framework dependent

For a smoke test, reduce MATU_MAX_RANK in quick_start/.env or cp2.max_rank in configs/default.yaml.

Tests And Makefile

Common checks are collected in the Makefile:

make install-dev
make test
make quick-eval
make paper-eval
make check
make clean

make test runs the unit tests in tests/. make check runs tests, compileall, CLI help, MATH quick evaluation, and the MMLU paper-matching evaluation. make clean removes Python caches, build outputs, and :Zone.Identifier sidecar files.

Docker

Build a CPU-ready image:

docker build -t matu-uq .

Run quick-start evaluation in the container:

docker run --rm matu-uq make quick-eval
docker run --rm matu-uq make paper-eval

For local outputs:

docker run --rm -v "$(pwd)/outputs:/app/outputs" matu-uq matu --help

PyPI Release

The package metadata is ready under the name matu-uq. The GitHub Actions workflow in .github/workflows/publish.yml uses PyPI Trusted Publishing. Before the first release, configure PyPI with:

project name: matu-uq
repository: tiejin98/MATU
workflow: publish.yml
environment: pypi

Then create a GitHub release. After the first successful PyPI upload, add the PyPI badge:

[![PyPI](https://img.shields.io/pypi/v/matu-uq.svg)](https://pypi.org/project/matu-uq/)

Project Structure

MATU/
|-- matu/                  # Main package and CLI
|-- baselines/             # EigV baseline
|-- configs/               # YAML configuration
|-- data/                  # Log and artifact format docs
|-- docs/                  # Architecture notes
|-- examples/              # Optional log collectors
|-- quick_start/           # Reproducible sample artifacts and scripts
|-- tests/                 # Lightweight unit tests
|-- requirements.txt
|-- pyproject.toml
|-- Makefile
|-- Dockerfile
`-- .env.example

Citation

If you find this work useful, please cite:

@inproceedings{chen2026every,
  title = {Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition},
  author = {Chen, Tiejin and Yao, Huaiyuan and Chen, Jia and Papalexakis, Evangelos E. and Wei, Hua},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
  year = {2026}
}

License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matu_uq-0.1.0.tar.gz (32.3 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matu_uq-0.1.0-py3-none-any.whl (23.2 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file matu_uq-0.1.0.tar.gz.

File metadata

Download URL: matu_uq-0.1.0.tar.gz
Upload date: Apr 30, 2026
Size: 32.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for matu_uq-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`de149f701878c5b4b8eb9a32bf6a57f14842dfe25e2f71a90bd2397b2279f09b`
MD5	`3b13795515f29ff8377a683f2a60c60f`
BLAKE2b-256	`be37befae29bfccecbed43bfcdd273599ac41376909f13d073a531d9fdc82b7b`

See more details on using hashes here.

File details

Details for the file matu_uq-0.1.0-py3-none-any.whl.

File metadata

Download URL: matu_uq-0.1.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 23.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for matu_uq-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60d1ecbd412fe993a480535246d8c7465f2e67d432eb797b771e8dbd998fb636`
MD5	`d277379f7cda4e4229da2eb34766ae2c`
BLAKE2b-256	`0d00ba255fae561f9c9b240a9b83c68dbb99ff137f94d724ab4e9237af387cf2`

See more details on using hashes here.

matu-uq 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MATU: Multi-Agent Tensor Uncertainty

Dependencies

Installation

Architecture

Quick Start

Quick Start Files

Option A: Standalone Quick-Start Scripts

Option B: Using the MATU CLI

Expected Results

Full Pipeline

Pipeline Stages

Output Directory Structure

Configuration

Data Sources

Hardware Requirements

Tests And Makefile

Docker

PyPI Release

Project Structure

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes