Skip to main content

Runtime classifier for screening AI agent actions as safe, harmful, or unethical.

Project description

Workflow Diagram

⚠️ When AI agents are provided with a harmful tool and an instruction, they just use it. Popular and high-performing latest LLMs are not an exception.

🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using HarmActionsBench. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on HarmActions dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Guard.

PyPI Website

🎬 Demo

Demo GIF

[!TIP] Please star ⭐ the repository if you find Action Guard is useful!

star

🚨 Common causes of harmful actions by AI agents:

  • 🔓 User attempting to jailbreak the model.
  • 🌀 Model hallucinating or misunderstanding the context.
  • 💭 Model being overconfident in its incorrect knowledge.
  • 🚧 Lack of proper constraints or guidelines for the agent.
  • 📉 Inadequate training data for specific scenarios.
  • 🛠️ Tools with incorrect descriptions that mislead the agent.
  • 🎭 Harmful tools descriptions including manipulative text to mislead the model.
  • 😬 The experiments proved that the model performs a harmful action and still responds "Sorry, I can't help with that."

🆕 New contributions of Agent-Action-Guard framework:

  1. 📊 HarmActions, a structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
  2. 📏 HarmActionsBench benchmark leveraging a new metric "SafeActions@k."
  3. 🧠 Action Guard, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
  4. 🔌 MCP integration supporting live action screening using existing MCP servers and clients.

📊 HarmActionsBench Results

⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and HarmActionsBench benchmark.

Model SafeActions@1 score
Phi 4 Mini Instruct 0.00%
Granite 4-H-Tiny 0.00%
*Claude Haiku 4.5 0.00%
*Gemini 3.1 Flash Lite 1.33%
Ministral 3 (3B) 2.67%
*Claude Sonnet 4.6 4.00%
Phi 4 Mini Reasoning 5.33%
*GPT-5.3 17.33%
Average 3.83%

*popular proprietary models.

📌 Note: Higher SafeActions@k score is better.

✨ Special features:

  • This project introduces "HarmActionsBench" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
  • The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
  • ⚡ The model is lightweight and can be easily integrated into existing AI agent frameworks.

🛡️ Safety Features:

  • 🔍 Automatically classifies tool calls before execution.
  • 🚫 Blocks harmful actions based on the outputs of the trained model.
  • 📋 Provides detailed classification results.
  • ✅ Allows safe actions to proceed normally.

💬 Feedback

❤️ Love Action Guard? Please share a quick note at https://github.com/Pro-GenAI/Agent-Action-Guard/discussions/15. It really helps shape the project to create a major impact on the AI field. 🙌 Waiting with excitement for feedback and discussions on how this helps you or the AI community.

Project banner

🚀 Usage

⚡ Quick install:

Using uv:

uv venv
source .venv/bin/activate
uv pip install agent-action-guard

📖 For usage instructions, kindly refer https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/USAGE.md.

🔑 Note: The embedding client accepts an API key via the EMBEDDING_API_KEY environment variable (falls back to OPENAI_API_KEY if unset). See .env.example and USAGE.md for examples.

📦 Install with HarmActionsBench:

pip install "agent-action-guard[harmactionsbench]"
python -m agent_action_guard.harmactionsbench

Note: The usage of HarmActionsBench requires OpenAI API key to be set in the environment variables.

🏷️ License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share and adapt the material for any purpose, even commercially, as long as you provide appropriate credit to the original author(s) and indicate if changes were made. License: CC BY 4.0

If attribution is not desired, please send an acknowledgment email to praneeth.vad@gmail.com with the details of how you use the work and the impact it has on your project or research.

📝 Citation

If you find this repository useful in your research, please consider citing:

@article{202510.1415,
	title = {Agent Action Guard: Classifying AI Agent Actions to Ensure Safety and Reliability},
  	year = 2025,
	month = {October},
	publisher = {Preprints},
	author = {Praneeth Vadlapati},
	doi = {10.20944/preprints202510.1415.v1},
	url = {https://doi.org/10.20944/preprints202510.1415.v1},
	journal = {Preprints}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_action_guard-1.0.7.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_action_guard-1.0.7-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file agent_action_guard-1.0.7.tar.gz.

File metadata

  • Download URL: agent_action_guard-1.0.7.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_action_guard-1.0.7.tar.gz
Algorithm Hash digest
SHA256 1f52864fa621697556f04735bcb0e3a56f7b9473d8c6cc0cec6e41eb6cf17903
MD5 7eabca6b9a4ab3a3aeb8df41cdf623b8
BLAKE2b-256 0e0702cf0938be37583b5a4509dc400a21138abf78314f61c30e19eeb2788d77

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_action_guard-1.0.7.tar.gz:

Publisher: publish-pypi.yml on Pro-GenAI/Agent-Action-Guard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_action_guard-1.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_action_guard-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ee2947b746e5d85d22b853942b351f1cfd542f90a67de85c1047b0145dd738e0
MD5 902a800418fd9577d8be74ed1e070073
BLAKE2b-256 19bdc4c2de747c9f01f996fc0b5d3a7eaa9390a33192dae9c5f34862fa9a3788

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_action_guard-1.0.7-py3-none-any.whl:

Publisher: publish-pypi.yml on Pro-GenAI/Agent-Action-Guard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page