Skip to main content

Runtime classifier for screening AI agent actions as safe, harmful, or unethical.

Project description

Project banner Workflow Diagram

AI is perceived as a threat. Increasing usage of LLM Agents and MCP leads to the usage of harmful tools and harmful usage of tools as proven using HarmActEval. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on HarmActions dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. Safe AI Agents are made possible by Action Classifier.

Preprint YouTube Blog

AI LLMs Python License: CC BY 4.0 HuggingFace Dataset

Demo

Demo GIF

[!TIP] Please star ⭐ the repository if you find Action Guard is useful!

star

Common causes of harmful actions by AI agents:

  • User attempting to jailbreak the model.
  • Model hallucinating or misunderstanding the context.
  • Model being overconfident in its incorrect knowledge.
  • Lack of proper constraints or guidelines for the agent.
  • Inadequate training data for specific scenarios.
  • MCP server providing incorrect tool descriptions that mislead the agent.
  • Harmful MCP servers returning manipulative text to mislead the model.
  • The experiments proved that the model performs a harmful action and still responds "Sorry, I can't help with that."

New contributions of Agent-Action-Guard framework:

  1. HarmActions, an structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
  2. HarmActEval benchmark leveraging a new metric “Harm@k.”
  3. Action Classifier, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
  4. MCP integration supporting live action screening using existing MCP servers and clients.

Special features:

  • This project introduces "HarmActEval" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
  • The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
  • The model is lightweight and can be easily integrated into existing AI agent frameworks like MCP.
  • This project is about classifying actions and not related to Guardrails.
  • Supports MCP (Model Context Protocol) to allow real-time action classification.
  • Unlike OpenAI's "require_approval": "always" flag, this blocks harmful actions without human intervention.
  • A2A-compatible version: https://github.com/Pro-GenAI/A2A-Agent-Action-Guard.

Safety Features:

  • Automatically classifies MCP tool calls before execution.
  • Blocks harmful actions based on the outputs of the trained model
  • Provides detailed classification results
  • Allows safe actions to proceed normally

Waiting with excitement for feedback and discussions on how this helps you or the AI community.

Want to use this in your commercial projects or want customization for your use case? I can do it for you or guide you. Please contact me at praneeth.vad@gmail.com.

Usage

For usage instructions, kindly refer USAGE.md.

PyPI package scope:

  • pip install agent-action-guard installs only the runtime classifier modules and model file needed for action classification.
  • Training, evaluation, MCP demo servers, and UI scripts remain in this repository and require the dev extras.

Quick install:

Using uv:

uv venv
source .venv/bin/activate
uv sync

Using python + pip:

python3 -m venv .venv
source .venv/bin/activate
pip install -e .

For training, evaluation, and demo tooling, install the dev extra:

Using uv: uv sync --extra dev

Using pip: pip install -e ".[dev]"

A2A version:

While this repository focuses on standard tool calls and MCP, an Agent-to-Agent (A2A) compatible version is available at: https://github.com/Pro-GenAI/A2A-Agent-Action-Guard

Citation

If you find this repository useful in your research, please consider citing:

@article{202510.1415,
	title = {Agent Action Guard: Classifying AI Agent Actions to Ensure Safety and Reliability},
  	year = 2025,
	month = {October},
	publisher = {Preprints},
	author = {Praneeth Vadlapati},
	doi = {10.20944/preprints202510.1415.v1},
	url = {https://doi.org/10.20944/preprints202510.1415.v1},
	journal = {Preprints}
}

Limitation

Personally Identifiable Information (PII) detection is not performed by this project as it can be performed accurately using other existing systems.

Created based on my past work

Agent-Supervisor: Supervising Actions of Autonomous AI Agents for Ethical Compliance: GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_action_guard-1.0.0.tar.gz (237.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_action_guard-1.0.0-py3-none-any.whl (231.7 kB view details)

Uploaded Python 3

File details

Details for the file agent_action_guard-1.0.0.tar.gz.

File metadata

  • Download URL: agent_action_guard-1.0.0.tar.gz
  • Upload date:
  • Size: 237.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_action_guard-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5a05adf58c824c05360fcd9a158406b7cd97b3109cfcb0ba0c27baaa7c3cbdcf
MD5 3260427f141b6a65487daed355c0421f
BLAKE2b-256 24c78424f52e62d257e465f9206676269217c294625e32c88032784a73b2b664

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_action_guard-1.0.0.tar.gz:

Publisher: publish-pypi.yml on Pro-GenAI/Agent-Action-Guard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_action_guard-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_action_guard-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1185d7f866eff3f4309cfda3c27359ad1d9a4354504561c85bf11f30537afc3a
MD5 a11ae74fa8c75e690d1475ac141962b6
BLAKE2b-256 6034ae5a0c8420f5c930504ec7333644cf2794754fda250c912d3e217a26ec73

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_action_guard-1.0.0-py3-none-any.whl:

Publisher: publish-pypi.yml on Pro-GenAI/Agent-Action-Guard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page