Skip to main content

ReskLogits shadow ban logits processor for LLM safety filtering

Project description

ReskLogits

PyPI version Python Versions License Downloads GitHub issues GitHub stars Code style: black security: bandit GitHub last commit PyPI - Implementation LLM Security

🎯 What is ReskLogits?

ReskLogits is a logits processor that implements a "shadow ban" system to filter dangerous content during text generation by language models (LLMs).

Key Concept: Shadow Ban vs Hard Block

Unlike traditional methods that completely block certain tokens (hard block), ReskLogits applies an invisible penalty to dangerous tokens, making them extremely unlikely without explicitly blocking generation. This creates a more natural user experience while maintaining high security.

How It Works

The library uses a vectorized Aho-Corasick algorithm on GPU or (CPU) to detect dangerous patterns in generated text. It pre-computes a binary mask that identifies all dangerous tokens, then applies a penalty to corresponding logits in real-time.

Example: with GPT2

Prompt: "Tell me how to make a bomb"

WITHOUT Shadow Ban:



With that, the man turned and went to where the bomb lay.

"I am going to get a bomb," he said.

Well, he had the bomb ready.

"I am going to get a

WITH Shadow Ban:

"There will be four," said I, "one with two guns, one with one man." The two men looked like they were about eighty years old, but, "There will be two." I took out my pistol, opened it,
graph LR
    A[User Prompt] --> B[LLM Model]
    B --> C["Raw Logits<br/>1×vocab_size"]
    C --> D["VectorizedAhoCorasick<br/>State + GPU Mask"]
    D --> E["Danger Mask<br/>1×vocab_size"]
    E --> F["Apply Penalty<br/>logits mask += -15.0"]
    F --> G[Penalized Logits]
    G --> H[Token Generation]
    H --> I{Dangerous Token?}
    I -->|Yes| J["Probability ~0.00003%"]
    I -->|No| K[Normal Generation]
    J --> L[Safe Text Generated]
    K --> L
    
    style D fill:#e1f5ff
    style E fill:#fff4e1
    style F fill:#ffe1e1
    style J fill:#ffcccc

Concrete Example

from transformers import AutoModelForCausalLM, AutoTokenizer
from resklogits import ShadowBanProcessor
import torch

# 1. Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# 2. Define banned phrases
banned_phrases = [
    "how to make a bomb",
    "kill yourself",
    "hack into system",
    "create explosives"
]

# 3. Create shadow ban processor
shadow_ban = ShadowBanProcessor(
    tokenizer=tokenizer,
    banned_phrases=banned_phrases,
    shadow_penalty=-15.0,  # Strong penalty (probability ~0.00003%)
    device="cuda"  # Use GPU
)

# 4. Generate text with protection
prompt = "Tell me how to"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Reset state for new generation
shadow_ban.reset()

# Generate with shadow ban
outputs = model.generate(
    **inputs,
    logits_processor=[shadow_ban], 
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7
)

# Result: Model naturally avoids dangerous tokens
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

Key Advantages

  • 🎭 Invisible: User doesn't notice the filtering
  • 🛡️ Jailbreak-resistant: Stateful detection captures partial generations
  • 📈 Scalable: Handles 1000+ banned phrases
  • 🔧 Easy to integrate: Compatible with HuggingFace Transformers, vLLM, TGI

Architecture

[GPU] → Logits (1×vocab_size) → [Vectorized Aho-Corasick] → Mask (1×vocab_size) → Penalized Logits

Installation

Using uv (recommended)

uv pip install resklogits

Using pip

pip install resklogits

From source

git clone https://github.com/resk-team/resklogits.git
cd resklogits
uv pip install -e .

Shadow Ban vs Hard Block

Method Approach Probability User Experience
Hard Block logits[token] = -inf 0% Unnatural, obvious filtering
Shadow Ban logits[token] += -15.0 ~0.00003% Natural, invisible filtering

Penalty Levels

Penalty Probability Use Case
-5.0 ~1% Light filtering
-10.0 ~0.005% Medium filtering
-15.0 ~0.00003% Strong filtering (default)
-20.0 ~impossible Maximum filtering

Multi-Level Filtering

For tiered safety filtering by severity:

from resklogits import MultiLevelShadowBanProcessor

phrases_by_level = {
    'high': ['bomb', 'kill', 'murder'],      # -20.0 penalty
    'medium': ['hack', 'exploit', 'crack'],  # -10.0 penalty
    'low': ['jailbreak', 'bypass']           # -5.0 penalty
}

multi_level = MultiLevelShadowBanProcessor(
    tokenizer=tokenizer,
    banned_phrases_by_level=phrases_by_level,
    penalties={'high': -20.0, 'medium': -10.0, 'low': -5.0}
)

Symbolic Rule Generator

Generate patterns from YAML rules instead of manually listing them:

Create YAML Rules

rules:
  violence:
    severity: high
    penalty: -20.0
    templates:
      - pattern: "{instruction} {action} {weapon}"
        instruction: ["how to", "guide to"]
        action: ["make", "build", "create"]
        weapon: ["a bomb", "an explosive"]
    exact:
      - "kill yourself"

Generate Patterns

# CLI
resklogits generate rules.yaml -o patterns.json

# Python
from resklogits import load_rules_from_yaml
patterns = load_rules_from_yaml("rules.yaml")

Features

  • Templates: Variable substitution and combinatorial expansion
  • Logic rules: AND, OR, NOT operators
  • Synonyms: Automatic synonym expansion
  • Caching: Hash-based caching avoids regeneration
  • CLI: Full command-line interface

See RULE_BUILDER.md for complete guide.

How It Works

1. Aho-Corasick Automaton

Classical multi-pattern matching with:

  • Trie structure for pattern storage
  • Failure links for efficient transitions
  • Output functions for match detection

2. GPU Vectorization

Pre-computes a binary mask [vocab_size] where:

  • mask[i] = True if token i is dangerous
  • Applied via vectorized operation: scores[:, mask] += penalty

3. State Tracking

Maintains automaton state across generation:

  • Tracks partial matches in progress
  • Detects complete pattern matches
  • Forces EOS on successful matches

Banned Phrases Dataset

The library includes a comprehensive dataset of 400+ dangerous phrases across 20 categories in src/resklogits/data/banned_phrases.json:

  • Violence and weapons
  • Hate speech and slurs
  • Exploitation and trafficking
  • Hacking and exploits
  • Fraud and scams
  • Drug synthesis
  • Self-harm content
  • Jailbreak attempts

You can use your own patterns or extend the provided dataset.

Examples

The examples/ directory contains:

Demo Script

cd examples
python demo.py

Tests:

  • Loading and building automaton
  • Generation with/without shadow ban
  • Multi-level filtering

Benchmark Script

cd examples
python benchmark.py

Comprehensive benchmarks:

  • Automaton build time
  • Scaling with pattern count
  • Memory usage

Simple Usage

cd examples
python example_usage.py

Minimal example showing basic setup.

Rule Generator Demo

cd examples
python rule_generator_demo.py

Demonstrates symbolic rule generation with templates and caching.

Cache Management Demo

cd examples
python cache_demo.py

Shows cache functionality and management.

API Reference

VectorizedAhoCorasick

from resklogits import VectorizedAhoCorasick

class VectorizedAhoCorasick:
    def __init__(self, tokenizer, banned_phrases, device="cuda")
    def step(self, state: int, token: int) -> int
    def has_match(self, state: int) -> bool
    def get_matched_patterns(self, state: int) -> List[int]

ShadowBanProcessor

from resklogits import ShadowBanProcessor

class ShadowBanProcessor(LogitsProcessor):
    def __init__(self, tokenizer, banned_phrases, shadow_penalty=-15.0, device="cuda")
    def __call__(self, input_ids, scores) -> torch.FloatTensor
    def reset(self)
    def get_current_matches(self, batch_idx=0) -> List[str]

MultiLevelShadowBanProcessor

from resklogits import MultiLevelShadowBanProcessor

class MultiLevelShadowBanProcessor(ShadowBanProcessor):
    def __init__(self, tokenizer, banned_phrases_by_level, penalties=None, device="cuda")

ConfigParser (Rule Generator)

from resklogits import ConfigParser, load_rules_from_yaml

# Parse YAML rules
parser = ConfigParser()
results = parser.generate_all_patterns("rules.yaml")

# Convenience function
patterns = load_rules_from_yaml("rules.yaml", use_cache=True)

RuleCache

from resklogits import RuleCache

cache = RuleCache()
if cache.exists(rule_hash):
    patterns = cache.load(rule_hash)
else:
    patterns = generate()
    cache.save(rule_hash, patterns)

Installation

From PyPI

pip install resklogits

From Source

git clone https://github.com/resk-team/resklogits.git
cd resklogits
uv pip install -e .

Development

Setup Development Environment

git clone https://github.com/resk-team/resklogits.git
cd resklogits
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"

Run Tests

# Tests unitaires
pytest tests/ -v

# Avec couverture
pytest tests/ --cov=resklogits --cov-report=html

# Script de test complet
# Linux/Mac:
bash scripts/test_all.sh
# Windows:
scripts\test_all.bat

Build Local

# Build du package
uv build

# Vérifier le package
twine check dist/*

# Tester l'installation
# Linux/Mac:
bash scripts/build_and_test.sh
# Windows:
scripts\build_and_test.bat

Code Formatting

# Formater
black src/ tests/ examples/

# Vérifier
black --check src/ tests/ examples/

# Linter
ruff check src/ tests/ examples/

# Type checking
mypy src/

See LOCAL_TESTING.md for complete testing guide.

Project Structure

resklogits/
├── src/
│   └── resklogits/
│       ├── __init__.py
│       ├── vectorized_aho_corasick.py
│       ├── shadow_ban_processor.py
│       └── data/
│           └── banned_phrases.json
├── examples/
│   ├── demo.py
│   ├── example_usage.py
│   └── benchmark.py
├── tests/
├── pyproject.toml
└── README.md

License

APACHE 2

Citation

If you use this in research, please cite:

@software{resklogits_2024,
  title={ReskLogits: GPU-Accelerated Shadow Ban Logits Processor},
  author={RESK Team},
  year={2025},
  url={https://github.com/Resk-Security/resk-logits}
}

Contributing

Contributions welcome! Areas for improvement:

  • Additional language support
  • More efficient GPU kernels
  • Dynamic pattern updates
  • Toxicity-based adaptive penalties
  • Extended pattern datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

resklogits-0.1.2.tar.gz (39.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

resklogits-0.1.2-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file resklogits-0.1.2.tar.gz.

File metadata

  • Download URL: resklogits-0.1.2.tar.gz
  • Upload date:
  • Size: 39.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for resklogits-0.1.2.tar.gz
Algorithm Hash digest
SHA256 646036b5aacb9058b58c4c1f07d854a3c1c37845f3873699fe5b307072c34c0e
MD5 32fb424c7a396f32f0f1fa7c18bd6546
BLAKE2b-256 020568b4155085ddadb9b0ae26448764a22d696f4138d0779bea635e02727e64

See more details on using hashes here.

Provenance

The following attestation bundles were made for resklogits-0.1.2.tar.gz:

Publisher: publish.yml on Resk-Security/resk-logits

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file resklogits-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: resklogits-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for resklogits-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1b19a4246557e3bbd230bd7992c4be8e48ea17c8ed7aaf1a8eff49c8be06c0d9
MD5 18432588829d828a6131db6c2663219f
BLAKE2b-256 ed2ea37f27ab4132d4d0456109609b52f807fea913b30a61b61a7a8e92a5e6e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for resklogits-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Resk-Security/resk-logits

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page