Skip to main content

Lightweight payload classifier for HTTP log gatekeeping.

Project description

Gatekeeper ML

Gatekeeper ML is a lightweight binary classifier for HTTP log components. It uses character-level TF-IDF plus handcrafted text statistics to classify raw strings as either:

  • 0 for normal
  • 1 for suspicious

The design goal is low-latency screening. Safe-looking values can bypass the ML model entirely through a conservative regex fast path, and only suspicious candidates need deeper downstream analysis.

Project Structure

gatekeeper-ml/
├── cli.py
├── data/
│   ├── raw/
│   └── processed/
├── models/
├── pyproject.toml
├── src/
│   └── gatekeeper_ml/
│       ├── __init__.py
│       ├── config.py
│       ├── data_loader.py
│       ├── features.py
│       ├── predict.py
│       ├── train.py
│       └── models/
│           └── payload_classifier.pkl
├── README.md
├── MANIFEST.in
└── requirements.txt

1. Create a Virtual Environment

From the gatekeeper-ml directory:

Windows PowerShell

python -m venv .venv
.venv\Scripts\Activate.ps1

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

2. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

For local development as an installable package:

pip install -e .

To build and publish distribution artifacts, also install the packaging tools:

pip install build twine

3. Fetch and Prepare the Dataset

This command downloads suspicious payloads from PayloadsAllTheThings, generates normal HTTP-like samples, and writes the combined processed dataset to data/processed/.

python cli.py fetch

To force a fresh download from the upstream payload sources:

python cli.py fetch --force-refresh

4. Train the Model

This command:

  • loads the processed dataset
  • trains the RandomForest-based pipeline
  • saves the serialized .pkl model artifact to models/
  • prints Precision, Recall, F1-Score, and the recall-oriented threshold
python cli.py train

To rebuild the dataset before training:

python cli.py train --force-refresh

Expected artifacts:

  • models/gatekeeper_payload_classifier.pkl
  • models/training_metrics.json

5. Run Predictions

Pass one or more strings directly on the command line:

python cli.py predict "/api/v1/users/42" "<script>alert(1)</script>" "' OR 1=1 --"

Example output:

Prediction results
0    /api/v1/users/42
1    <script>alert(1)</script>
1    ' OR 1=1 --
Total inference time: 4.812 ms
Average per input:    1.604 ms

To point at a custom trained model:

python cli.py predict --model-path models/gatekeeper_payload_classifier.pkl "12345" "../../etc/passwd"

6. Use as a Python Library

After pip install -e ., you can import the classifier directly:

from gatekeeper_ml import PayloadClassifier

# Option 1: Default load
clf = PayloadClassifier()

# Option 2: Custom path load
clf = PayloadClassifier(model_path="path/to/your_model.pkl")

predictions = clf.predict_batch([
    "/api/v1/users/42",
    "<script>alert(1)</script>",
])

7. Build the Package

Before building a release, update the version in pyproject.toml.

Build both the source distribution and wheel:

python -m build

This creates the release artifacts in dist/:

  • dist/gatekeeper_ml-<version>.tar.gz
  • dist/gatekeeper_ml-<version>-py3-none-any.whl

Validate the generated metadata before upload:

python -m twine check dist/*

8. Publish to TestPyPI

TestPyPI is the safest way to verify packaging before a real release.

Create a TestPyPI API token, then export it:

Windows PowerShell

$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."

macOS / Linux

export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."

Upload to TestPyPI:

python -m twine upload --repository testpypi dist/*

You can then verify installation from TestPyPI:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple gatekeeper-ml

9. Publish to PyPI

After the TestPyPI release looks good, upload the same dist/ artifacts to PyPI:

python -m twine upload dist/*

Recommended release flow:

  1. Update the version in pyproject.toml.
  2. Remove old build artifacts if needed: rm -rf dist build *.egg-info or the Windows equivalent.
  3. Run python -m build.
  4. Run python -m twine check dist/*.
  5. Publish to TestPyPI.
  6. Install and smoke-test from TestPyPI.
  7. Publish to PyPI.

10. Trusted Publishing

For CI/CD releases, PyPI now supports Trusted Publishing via OIDC, which avoids storing long-lived API tokens in your repository secrets. If you publish from GitHub Actions or another supported CI provider, this is the recommended long-term setup.

References:

Notes

  • The data loader prefers live GitHub payload sources but includes a bootstrap suspicious set so the pipeline remains usable when remote fetches fail.
  • The predict command loads the model once and evaluates inputs in batch form for low overhead.
  • Fast-path safe heuristics are intentionally conservative to preserve recall.
  • A dedicated module usage guide is available at docs/USAGE.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gatekeeper_ml-0.1.4.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gatekeeper_ml-0.1.4-py3-none-any.whl (2.5 MB view details)

Uploaded Python 3

File details

Details for the file gatekeeper_ml-0.1.4.tar.gz.

File metadata

  • Download URL: gatekeeper_ml-0.1.4.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7c28a47ecea451d8a3cf4bfbc068d0d5caabbb93ca5d9e52312ea4fdc6817455
MD5 345a2426aac536088c77908cc22e29fb
BLAKE2b-256 7d91cf75a00c14429cdde79ab58ee8eff6241e651c19c58040ef52efc7f3bb21

See more details on using hashes here.

File details

Details for the file gatekeeper_ml-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: gatekeeper_ml-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 24b3f161f5ec3fb3a5fc7cfe38074f414e0ea02c6759d1f957fd0ce97b86c9b7
MD5 4ea6f006852a53c9d2754fc98894ea88
BLAKE2b-256 f737cff78669e122da6c9cce9c7a3bc2f397880e41c75b563c6fd5c7ce143ade

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page