Skip to main content

Lightweight payload classifier for HTTP log gatekeeping.

Project description

Gatekeeper ML

Gatekeeper ML is a lightweight binary classifier for HTTP log components. It uses character-level TF-IDF plus handcrafted text statistics to classify raw strings as either:

  • 0 for normal
  • 1 for suspicious

The design goal is low-latency screening. Safe-looking values can bypass the ML model entirely through a conservative regex fast path, and only suspicious candidates need deeper downstream analysis.

Project Structure

gatekeeper-ml/
├── cli.py
├── data/
│   ├── raw/
│   └── processed/
├── models/
├── pyproject.toml
├── src/
│   └── gatekeeper_ml/
│       ├── __init__.py
│       ├── config.py
│       ├── data_loader.py
│       ├── features.py
│       ├── predict.py
│       ├── train.py
│       └── models/
│           └── payload_classifier.pkl
├── README.md
├── MANIFEST.in
└── requirements.txt

1. Create a Virtual Environment

From the gatekeeper-ml directory:

Windows PowerShell

python -m venv .venv
.venv\Scripts\Activate.ps1

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

2. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

For local development as an installable package:

pip install -e .

To build and publish distribution artifacts, also install the packaging tools:

pip install build twine

3. Fetch and Prepare the Dataset

This command downloads suspicious payloads from PayloadsAllTheThings, generates normal HTTP-like samples, and writes the combined processed dataset to data/processed/.

python cli.py fetch

To force a fresh download from the upstream payload sources:

python cli.py fetch --force-refresh

4. Train the Model

This command:

  • loads the processed dataset
  • trains the RandomForest-based pipeline
  • saves the serialized .pkl model artifact to models/
  • prints Precision, Recall, F1-Score, and the recall-oriented threshold
python cli.py train

To rebuild the dataset before training:

python cli.py train --force-refresh

Expected artifacts:

  • models/gatekeeper_payload_classifier.pkl
  • models/training_metrics.json

5. Run Predictions

Pass one or more strings directly on the command line:

python cli.py predict "/api/v1/users/42" "<script>alert(1)</script>" "' OR 1=1 --"

Example output:

Prediction results
0    /api/v1/users/42
1    <script>alert(1)</script>
1    ' OR 1=1 --
Total inference time: 4.812 ms
Average per input:    1.604 ms

To point at a custom trained model:

python cli.py predict --model-path models/gatekeeper_payload_classifier.pkl "12345" "../../etc/passwd"

6. Use as a Python Library

After pip install -e ., you can import the classifier directly:

from gatekeeper_ml import PayloadClassifier

# Option 1: Default load
clf = PayloadClassifier()

# Option 2: Custom path load
clf = PayloadClassifier(model_path="path/to/your_model.pkl")

predictions = clf.predict_batch([
    "/api/v1/users/42",
    "<script>alert(1)</script>",
])

7. Build the Package

Before building a release, update the version in pyproject.toml.

Build both the source distribution and wheel:

python -m build

This creates the release artifacts in dist/:

  • dist/gatekeeper_ml-<version>.tar.gz
  • dist/gatekeeper_ml-<version>-py3-none-any.whl

Validate the generated metadata before upload:

python -m twine check dist/*

8. Publish to TestPyPI

TestPyPI is the safest way to verify packaging before a real release.

Create a TestPyPI API token, then export it:

Windows PowerShell

$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."

macOS / Linux

export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."

Upload to TestPyPI:

python -m twine upload --repository testpypi dist/*

You can then verify installation from TestPyPI:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple gatekeeper-ml

9. Publish to PyPI

After the TestPyPI release looks good, upload the same dist/ artifacts to PyPI:

python -m twine upload dist/*

Recommended release flow:

  1. Update the version in pyproject.toml.
  2. Remove old build artifacts if needed: rm -rf dist build *.egg-info or the Windows equivalent.
  3. Run python -m build.
  4. Run python -m twine check dist/*.
  5. Publish to TestPyPI.
  6. Install and smoke-test from TestPyPI.
  7. Publish to PyPI.

10. Trusted Publishing

For CI/CD releases, PyPI now supports Trusted Publishing via OIDC, which avoids storing long-lived API tokens in your repository secrets. If you publish from GitHub Actions or another supported CI provider, this is the recommended long-term setup.

References:

Notes

  • The data loader prefers live GitHub payload sources but includes a bootstrap suspicious set so the pipeline remains usable when remote fetches fail.
  • The predict command loads the model once and evaluates inputs in batch form for low overhead.
  • Fast-path safe heuristics are intentionally conservative to preserve recall.
  • A dedicated module usage guide is available at docs/USAGE.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gatekeeper_ml-0.1.6.tar.gz (2.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gatekeeper_ml-0.1.6-py3-none-any.whl (2.8 MB view details)

Uploaded Python 3

File details

Details for the file gatekeeper_ml-0.1.6.tar.gz.

File metadata

  • Download URL: gatekeeper_ml-0.1.6.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.6.tar.gz
Algorithm Hash digest
SHA256 e30890166cc8324fd17939825c4657ccac9b5ccf4ba59f63e9871a95ad9f60bc
MD5 9b84d54548f40393e2343e4b7198ea92
BLAKE2b-256 1db51cb0cb00531467837ae89132e838e61478e664f7737d8b2dd5bd7311f2f3

See more details on using hashes here.

File details

Details for the file gatekeeper_ml-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: gatekeeper_ml-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 cdd4e7defd1e789a52279f3368c3072149e4b6246f77383885afd23432c7df71
MD5 b677544cd05d31d8f4f393b1ecb77412
BLAKE2b-256 6c3e71ee9cedcfd966b41d1350d210fdba1a32e09b17d65dc19feaf61ecd6f1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page