Skip to main content

Lightweight payload classifier for HTTP log gatekeeping.

Project description

Gatekeeper ML

Gatekeeper ML is a lightweight binary classifier for HTTP log components. It uses character-level TF-IDF plus handcrafted text statistics to classify raw strings as either:

  • 0 for normal
  • 1 for suspicious

The design goal is low-latency screening. Safe-looking values can bypass the ML model entirely through a conservative regex fast path, and only suspicious candidates need deeper downstream analysis.

Project Structure

gatekeeper-ml/
├── cli.py
├── data/
│   ├── raw/
│   └── processed/
├── models/
├── pyproject.toml
├── src/
│   └── gatekeeper_ml/
│       ├── __init__.py
│       ├── config.py
│       ├── data_loader.py
│       ├── features.py
│       ├── predict.py
│       ├── train.py
│       └── models/
│           └── payload_classifier.pkl
├── README.md
├── MANIFEST.in
└── requirements.txt

1. Create a Virtual Environment

From the gatekeeper-ml directory:

Windows PowerShell

python -m venv .venv
.venv\Scripts\Activate.ps1

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

2. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

For local development as an installable package:

pip install -e .

To build and publish distribution artifacts, also install the packaging tools:

pip install build twine

3. Fetch and Prepare the Dataset

This command downloads suspicious payloads from PayloadsAllTheThings, generates normal HTTP-like samples, and writes the combined processed dataset to data/processed/.

python cli.py fetch

To force a fresh download from the upstream payload sources:

python cli.py fetch --force-refresh

4. Train the Model

This command:

  • loads the processed dataset
  • trains the RandomForest-based pipeline
  • saves the serialized .pkl model artifact to models/
  • prints Precision, Recall, F1-Score, and the recall-oriented threshold
python cli.py train

To rebuild the dataset before training:

python cli.py train --force-refresh

Expected artifacts:

  • models/gatekeeper_payload_classifier.pkl
  • models/training_metrics.json

5. Run Predictions

Pass one or more strings directly on the command line:

python cli.py predict "/api/v1/users/42" "<script>alert(1)</script>" "' OR 1=1 --"

Example output:

Prediction results
0    /api/v1/users/42
1    <script>alert(1)</script>
1    ' OR 1=1 --
Total inference time: 4.812 ms
Average per input:    1.604 ms

To point at a custom trained model:

python cli.py predict --model-path models/gatekeeper_payload_classifier.pkl "12345" "../../etc/passwd"

6. Use as a Python Library

After pip install -e ., you can import the classifier directly:

from gatekeeper_ml import PayloadClassifier

# Option 1: Default load
clf = PayloadClassifier()

# Option 2: Custom path load
clf = PayloadClassifier(model_path="path/to/your_model.pkl")

predictions = clf.predict_batch([
    "/api/v1/users/42",
    "<script>alert(1)</script>",
])

7. Build the Package

Before building a release, update the version in pyproject.toml.

Build both the source distribution and wheel:

python -m build

This creates the release artifacts in dist/:

  • dist/gatekeeper_ml-<version>.tar.gz
  • dist/gatekeeper_ml-<version>-py3-none-any.whl

Validate the generated metadata before upload:

python -m twine check dist/*

8. Publish to TestPyPI

TestPyPI is the safest way to verify packaging before a real release.

Create a TestPyPI API token, then export it:

Windows PowerShell

$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."

macOS / Linux

export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."

Upload to TestPyPI:

python -m twine upload --repository testpypi dist/*

You can then verify installation from TestPyPI:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple gatekeeper-ml

9. Publish to PyPI

After the TestPyPI release looks good, upload the same dist/ artifacts to PyPI:

python -m twine upload dist/*

Recommended release flow:

  1. Update the version in pyproject.toml.
  2. Remove old build artifacts if needed: rm -rf dist build *.egg-info or the Windows equivalent.
  3. Run python -m build.
  4. Run python -m twine check dist/*.
  5. Publish to TestPyPI.
  6. Install and smoke-test from TestPyPI.
  7. Publish to PyPI.

10. Trusted Publishing

For CI/CD releases, PyPI now supports Trusted Publishing via OIDC, which avoids storing long-lived API tokens in your repository secrets. If you publish from GitHub Actions or another supported CI provider, this is the recommended long-term setup.

References:

Notes

  • The data loader prefers live GitHub payload sources but includes a bootstrap suspicious set so the pipeline remains usable when remote fetches fail.
  • The predict command loads the model once and evaluates inputs in batch form for low overhead.
  • Fast-path safe heuristics are intentionally conservative to preserve recall.
  • A dedicated module usage guide is available at docs/USAGE.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gatekeeper_ml-0.1.2.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gatekeeper_ml-0.1.2-py3-none-any.whl (2.4 MB view details)

Uploaded Python 3

File details

Details for the file gatekeeper_ml-0.1.2.tar.gz.

File metadata

  • Download URL: gatekeeper_ml-0.1.2.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ddab819017357deaa64cf1110e9fac02c3a519ffa6796c52c0a41da0525d17c6
MD5 972ecccb5214c283ce1d927bdf8e145d
BLAKE2b-256 a8d88715a4787a2605bdee64983fd7f5fd6d00d884a14a22a9222ed4a4aa86a0

See more details on using hashes here.

File details

Details for the file gatekeeper_ml-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: gatekeeper_ml-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cc49b23f6e7602c3ec48e64034963294fae4d3b4d4bca6476147b0fa231abc1c
MD5 9b20066cda534943076fbf82b081ceb3
BLAKE2b-256 fce2a91f457ce4c3953f80333aba2f69e4c00605de5ad17a4ca2a000f2f65386

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page