mac-prompt

MAC: Multi-Agent Constitution Learning — a general-purpose rule learning engine

These details have not been verified by PyPI

Project links

Project description

MAC mountain logo MAC — Multi-Agent Constitution Learning

The explainable, auditable prompt optimizer

Rules you can read. Gains you can trust. No fine-tuning.

Healthcare PII +123% · Legal PII +17% · HoVer +63% · GSM8K → 100%

Overview

Constitutional AI steers LLM behavior through natural-language rules, but writing those rules by hand is hard, and getting them right is harder. Existing prompt optimizers try to automate this but fall short: they need many labeled examples, produce opaque prompt edits, and hit diminishing returns as prompts grow.

MAC fixes this. It optimizes over structured sets of rules using a network of specialized agents that propose, edit, and validate rule updates against a held-out set. The result is a human-readable constitution with no fine-tuning and no gradient updates.

Here are real rules MAC learned for PII tagging across three regulated domains:

Legal: "Mark as private specific dates when they appear in the context of personal events or actions, such as births, deaths, or significant life events. Do not mark general references or narrative text." Examples: mark "1975", "22 August 2003"; do not mark "on a day in June".

Healthcare: "Mark terms such as heart failure subtypes (e.g., diastolic heart failure, systolic heart failure) when explicitly mentioned in a patient's medical history as private. Do not mark generic medical conditions without an explicit subtype."

Finance: "Mark as private any phrase indicating a specific financial timeframe (e.g., FY2022, YTD FY2021) when it appears in direct association with identifiable information. Do not mark standalone labels without specific identifiers."

Why MAC over other prompt optimizers?

Explainable: every rule is natural language you can read, audit, and hand-edit. No black-box token shuffling.
Structured: optimizes over a set of rules, not a monolithic prompt blob. Rules are added, edited, or removed independently, so the constitution stays clean as it grows.
Auditable: each proposed rule is validated against a held-out batch before acceptance. You see exactly what changed and why.
Transferable: a constitution learned on one model works on another without retraining.
Sample-efficient: converges with far fewer labeled examples than GEPA or MIPRO.

Full documentation, prompting modes, and model configuration are at mac-prompt.com.

How It Works

MAC system overview — 4 agents iteratively learning constitution rules

Four agents coordinate in a closed loop each epoch:

Annotator runs the current constitution on a training batch and scores each example
Decision Agent looks at the errors and decides whether to add a new rule or edit an existing one
Rule Proposer drafts a candidate rule targeting the observed error pattern
Rule Editor refines existing rules to remove contradictions or sharpen specificity

Every proposed change is tested on a held-out validation batch. If the score goes up, the rule is accepted and the constitution advances to the next version. If not, the change is discarded. This means the constitution only ever improves, with no regression.

The meta-model and task adaptation

MAC works on any task because of a fifth component: a meta-model that runs once before training starts. It reads your task description, inspects real data samples, and rewrites the four agent prompts to be domain-specific, replacing generic placeholders with actual instructions about your task format, output schema, and evaluation criteria. This is a structural rewrite, not a word swap. After adaptation, all four agents speak the language of your task.

Any Task, Any Model

MAC is not tied to a single domain or model family. Give it a task_description, a few labeled examples, and a scoring function, and it learns a constitution for any task you can evaluate: classification, extraction, math, QA, tool calling, and more.

Under the hood, MAC uses a three-tier model setup:

Tier 1 (Worker): annotates examples using the current constitution. Can be a cheap local model (Qwen3-8B on vLLM) or a cloud API model (gpt-4o-mini). Runs on every example in every batch, so cost matters here.
Tier 2 (MAC agents): the four-agent network that proposes, edits, and validates rules. Should be a strong model (gpt-4o or equivalent). Runs far less frequently than the worker.
Tier 3 (Adapt model): rewrites agent prompts before training starts. Defaults to the same model as Tier 2 if not set separately.

compiler = MAC(
    model="Qwen/Qwen3-8B",              # Tier 1: worker (cheap / local)
    base_url="http://localhost:8000/v1", # vLLM server at any port
    mac_model="gpt-4o",                 # Tier 2: MAC agents (strong)
    # adapt_model="gpt-4o",             # Tier 3: defaults to mac_model
    task_description="Solve AIME competition math problems.",
    rule_type="math reasoning rules",
)

See Model Configuration for the full fallback cascade, vLLM port configuration, and provider examples.

Try it now

Step 1. Install and set your API key

pip install mac-prompt
export OPENAI_API_KEY=sk-...

Step 2. Import MAC

Example wraps a single labeled input/output pair. MAC is the optimizer. CompiledMAC is what you get back — a callable model you can save and reload.

from mac import Example, MAC, CompiledMAC

Step 3. Define your task

Tell MAC what you're solving and what kind of rules to learn.

optimizer = MAC(
    model="gpt-4o-mini",
    task_description="Solve AIME math problems. Return only the integer answer.",
    rule_type="math reasoning rules",
)

Step 4. Provide your data

A small labeled set is enough. MAC uses trainset to learn rules and holdout to validate each candidate rule before accepting it.

train = [
    Example(input="Find all integer bases b>9 where 17_b divides 97_b.", output="70"),
    Example(input="How many ordered pairs (x,y) in [-100,100] satisfy 12x²-xy-6y²=0?", output="117"),
]
holdout = [
    Example(input="Sum of positive integers n where n+2 divides 3(n+3)(n²+9)?", output="49"),
]

Step 5. Define a metric

Any function that returns a score between 0 and 1 works.

def metric(pred, gold):
    try: return 1.0 if float(str(pred).strip()) == float(str(gold).strip()) else 0.0
    except ValueError: return 0.0

Step 6. Run and inspect

compile() runs the rule-learning loop and returns a ready-to-call model.

optimized = optimizer.compile(trainset=train, holdout=holdout, metric=metric)

answer = optimized("Find all integer bases b>9 where 17_b divides 97_b.")
optimized.overview()           # shows the learned rules and baseline vs final score
optimized.save("rules.json")   # save and reload with CompiledMAC.load("rules.json")

MAC vs Prompt Optimizers

Domain-specific PII tagging across Legal, Finance, and Healthcare documents using Qwen2.5-Instruct models at 3B, 7B, and 14B scales.

MAC vs GEPA vs MIPRO — F1 by domain and model scale

Dataset	Method	3B	7B	14B
Legal	GEPA	12.7	52.1	50.1
Legal	MIPRO	13.2	38.6	44.3
Legal	MAC	36.0	55.1	67.3
Finance	GEPA	11.9	22.5	28.8
Finance	MIPRO	9.8	22.3	26.8
Finance	MAC	30.1	37.5	45.5
Healthcare	GEPA	16.5	12.9	16.8
Healthcare	MIPRO	12.5	16.8	20.6
Healthcare	MAC	9.7	20.1	26.7

8 of 9 configurations MAC wins. Largest gain: Legal 3B +174% over next best.

MAC vs Pretrained Taggers (14B)

Domain	MAC	Best Baseline	Gain
Legal	67.3	57.3 (Presidio)	+17%
Finance	45.5	44.7 (Presidio)	+2%
Healthcare	26.7	12.0 (GLiNER)	+123%

Training Dynamics

Validation F1 over training batches on 3B models (ECHR dataset). MAC steadily improves while baselines plateau or fluctuate.

Validation F1 vs training batches — MAC variants against baselines on 3B models (ECHR dataset)

Results: General-Purpose Benchmarks

MAC is task-agnostic. Below are results across three standard benchmarks ordered by largest gain: fact verification (HoVer), multi-hop QA (HotpotQA), and grade-school math (GSM8K).

Reading the tables

Each row is one (worker, MAC agents, style) configuration:

Worker: the model being optimised; annotates every example in every batch
MAC Agents: the decision, rule-proposer, and rule-editor agents that learn the constitution
Meta-Model: adapts all four agent prompts to your task before training starts; identical to MAC Agents in all runs below
Style: adapt means MAC builds the initial prompt from scratch via the meta-model; custom means the user supplies a prompt with {{CONSTITUTION_BLOCK}}

HoVer: Fact Verification

HoVer asks the model to verify multi-hop factual claims against Wikipedia evidence (a binary yes/no task). The biggest gain (+63%) comes from gpt-4o-mini worker with gpt-5.2 MAC agents in Auto-Adapt mode. Even with a fully local Qwen3-8B worker paired with gpt-5.2 MAC agents, MAC adds +26% in Custom Prompt mode.

HoVer results — Baseline vs Best

Worker	MAC Agents	Meta-Model	Style	Baseline	Best	Delta
gpt-4o-mini	gpt-5.2	gpt-5.2	adapt	25%	88%	+63%
gpt-4o-mini	gpt-5.2	gpt-5.2	custom	88%	88%	0%

Qwen3-8B	gpt-5.2	gpt-5.2	adapt	69%	81%	+12%
Qwen3-8B	gpt-5.2	gpt-5.2	custom	62%	88%	+26%

Qwen3-8B	Qwen3-8B	Qwen3-8B	adapt	75%	75%	0%
Qwen3-8B	Qwen3-8B	Qwen3-8B	custom	75%	81%	+6%

HotpotQA: Multi-Hop QA

HotpotQA requires chaining facts across two Wikipedia documents to answer a question. Baselines sit between 22% and 29%. MAC improves across every configuration except one, with the best gain of +14% when Qwen3-8B handles all three roles. The fully-local setup achieves the largest gain here because the local model has a lower baseline and more structured error patterns for the rule-proposer to target.

HotpotQA results — Baseline vs Best

Worker	MAC Agents	Meta-Model	Style	Baseline	Best	Delta
Qwen3-8B	Qwen3-8B	Qwen3-8B	adapt	27%	27%	0%
Qwen3-8B	Qwen3-8B	Qwen3-8B	custom	22%	36%	+14%

Qwen3-8B	gpt-5.2	gpt-5.2	adapt	29%	38%	+9%
Qwen3-8B	gpt-5.2	gpt-5.2	custom	29%	36%	+7%

gpt-4o-mini	gpt-5.2	gpt-5.2	adapt	25%	34%	+9%
gpt-4o-mini	gpt-5.2	gpt-5.2	custom	26%	32%	+6%

GSM8K: Math Reasoning

GSM8K tests multi-step arithmetic word problems. The Qwen3-8B baseline already scores 94% zero-shot, a high ceiling. Despite this, MAC pushes it to 100% in both modes when paired with gpt-5.2 MAC agents. The fully-local Qwen3-8B setup also reaches 100% in Custom Prompt mode.

GSM8K results — Baseline vs Best

Worker	MAC Agents	Meta-Model	Style	Baseline	Best	Delta
Qwen3-8B	gpt-5.2	gpt-5.2	adapt	94%	100%	+6%
Qwen3-8B	gpt-5.2	gpt-5.2	custom	94%	100%	+6%

Qwen3-8B	Qwen3-8B	Qwen3-8B	adapt	100%	100%	0%
Qwen3-8B	Qwen3-8B	Qwen3-8B	custom	94%	100%	+6%

gpt-4o-mini	gpt-5.2	gpt-5.2	adapt	100%	100%	0%
gpt-4o-mini	gpt-5.2	gpt-5.2	custom	94%	94%	0%

Documentation

Guide	Description
Usage Modes	Custom prompt (Mode 1) vs auto-adapt (Mode 2)
Model Configuration	Three-tier setup, fallback cascade, provider support
API Reference	Constructor kwargs, `compile()`, `CompiledMAC`
MAC Variants	reMAC, MAC+, Tool-MAC: advanced paper results
Examples	Self-contained benchmark scripts (GSM8K, HotpotQA, HoVer)

Citation

@article{thareja2025mac,
  title={MAC: Multi-Agent Constitution Learning for Generalizable Text Annotation},
  author={Thareja, Rushil},
  year={2025}
}

License

MAC is released under a dual license:

Non-commercial use (academic research, education, personal projects): free
Commercial use: requires a separate license. Contact rushil.thareja@mbzuai.ac.ae

See LICENSE for full terms.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Feb 27, 2026

1.0.1

Feb 27, 2026

1.0.0

Feb 27, 2026

0.1.0.post1

Feb 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mac_prompt-1.0.2.tar.gz (1.7 MB view details)

Uploaded Feb 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mac_prompt-1.0.2-py3-none-any.whl (110.6 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file mac_prompt-1.0.2.tar.gz.

File metadata

Download URL: mac_prompt-1.0.2.tar.gz
Upload date: Feb 27, 2026
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for mac_prompt-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`541f4423d85abfe0c97cce0dd566c83dc711b969558ed8efbe171984e91d6ceb`
MD5	`b4122ee318b9763d8a6c406908f1be15`
BLAKE2b-256	`09e292bdcdbb084afd35cab50862605f6c043ed8e21fc355dce005b874f14fbd`

See more details on using hashes here.

File details

Details for the file mac_prompt-1.0.2-py3-none-any.whl.

File metadata

Download URL: mac_prompt-1.0.2-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 110.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for mac_prompt-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3669e6af840ae18a989fcb6f642365a80cb164ee4890ff83c3e8bf4d29c74673`
MD5	`47631d4f154f2689b0bdc3fd7724d352`
BLAKE2b-256	`7b091222e0a04b46b703af9d58634cba6cd3712859df832476dad2379736997c`

See more details on using hashes here.

mac-prompt 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Why MAC over other prompt optimizers?

How It Works

The meta-model and task adaptation

Any Task, Any Model

Try it now

MAC vs Prompt Optimizers

MAC vs Pretrained Taggers (14B)

Training Dynamics

Results: General-Purpose Benchmarks

Reading the tables

HoVer: Fact Verification

HotpotQA: Multi-Hop QA

GSM8K: Math Reasoning

Documentation

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes