Lightweight payload classifier for HTTP log gatekeeping.

Project description

Gatekeeper ML

Gatekeeper ML is a lightweight binary classifier for HTTP log components. It uses character-level TF-IDF plus handcrafted text statistics to classify raw strings as either:

0 for normal
1 for suspicious

The design goal is low-latency screening. Safe-looking values can bypass the ML model entirely through a conservative regex fast path, and only suspicious candidates need deeper downstream analysis.

Project Structure

gatekeeper-ml/
├── cli.py
├── data/
│   ├── raw/
│   └── processed/
├── models/
├── pyproject.toml
├── src/
│   └── gatekeeper_ml/
│       ├── __init__.py
│       ├── config.py
│       ├── data_loader.py
│       ├── features.py
│       ├── predict.py
│       ├── train.py
│       └── models/
│           └── payload_classifier.pkl
├── README.md
├── MANIFEST.in
└── requirements.txt

1. Create a Virtual Environment

From the gatekeeper-ml directory:

Windows PowerShell

python -m venv .venv
.venv\Scripts\Activate.ps1

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

2. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

For local development as an installable package:

pip install -e .

To build and publish distribution artifacts, also install the packaging tools:

pip install build twine

3. Fetch and Prepare the Dataset

This command downloads suspicious payloads from PayloadsAllTheThings, generates normal HTTP-like samples, and writes the combined processed dataset to data/processed/.

python cli.py fetch

To force a fresh download from the upstream payload sources:

python cli.py fetch --force-refresh

4. Train the Model

This command:

loads the processed dataset
trains the RandomForest-based pipeline
saves the serialized .pkl model artifact to models/
prints Precision, Recall, F1-Score, and the recall-oriented threshold

python cli.py train

To rebuild the dataset before training:

python cli.py train --force-refresh

Expected artifacts:

models/gatekeeper_payload_classifier.pkl
models/training_metrics.json

5. Run Predictions

Pass one or more strings directly on the command line:

python cli.py predict "/api/v1/users/42" "<script>alert(1)</script>" "' OR 1=1 --"

Example output:

Prediction results
0    /api/v1/users/42
1    <script>alert(1)</script>
1    ' OR 1=1 --
Total inference time: 4.812 ms
Average per input:    1.604 ms

To point at a custom trained model:

python cli.py predict --model-path models/gatekeeper_payload_classifier.pkl "12345" "../../etc/passwd"

6. Use as a Python Library

After pip install -e ., you can import the classifier directly:

from gatekeeper_ml import PayloadClassifier

# Option 1: Default load
clf = PayloadClassifier()

# Option 2: Custom path load
clf = PayloadClassifier(model_path="path/to/your_model.pkl")

predictions = clf.predict_batch([
    "/api/v1/users/42",
    "<script>alert(1)</script>",
])

7. Build the Package

Before building a release, update the version in pyproject.toml.

Build both the source distribution and wheel:

python -m build

This creates the release artifacts in dist/:

dist/gatekeeper_ml-<version>.tar.gz
dist/gatekeeper_ml-<version>-py3-none-any.whl

Validate the generated metadata before upload:

python -m twine check dist/*

8. Publish to TestPyPI

TestPyPI is the safest way to verify packaging before a real release.

Create a TestPyPI API token, then export it:

Windows PowerShell

$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."

macOS / Linux

export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."

Upload to TestPyPI:

python -m twine upload --repository testpypi dist/*

You can then verify installation from TestPyPI:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple gatekeeper-ml

9. Publish to PyPI

After the TestPyPI release looks good, upload the same dist/ artifacts to PyPI:

python -m twine upload dist/*

Recommended release flow:

Update the version in pyproject.toml.
Remove old build artifacts if needed: rm -rf dist build *.egg-info or the Windows equivalent.
Run python -m build.
Run python -m twine check dist/*.
Publish to TestPyPI.
Install and smoke-test from TestPyPI.
Publish to PyPI.

10. Trusted Publishing

For CI/CD releases, PyPI now supports Trusted Publishing via OIDC, which avoids storing long-lived API tokens in your repository secrets. If you publish from GitHub Actions or another supported CI provider, this is the recommended long-term setup.

References:

PyPI Trusted Publishing docs: https://docs.pypi.org/trusted-publishers/using-a-publisher/
Twine project page: https://pypi.org/project/twine/

Notes

The data loader prefers live GitHub payload sources but includes a bootstrap suspicious set so the pipeline remains usable when remote fetches fail.
The predict command loads the model once and evaluates inputs in batch form for low overhead.
Fast-path safe heuristics are intentionally conservative to preserve recall.
A dedicated module usage guide is available at docs/USAGE.md.

Project details

Release history Release notifications | RSS feed

0.1.7

Apr 19, 2026

0.1.6

Apr 19, 2026

0.1.5

Apr 19, 2026

This version

0.1.4

Apr 19, 2026

0.1.3

Apr 19, 2026

0.1.2

Apr 19, 2026

0.1.1

Apr 18, 2026

0.1.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gatekeeper_ml-0.1.4.tar.gz (2.5 MB view details)

Uploaded Apr 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gatekeeper_ml-0.1.4-py3-none-any.whl (2.5 MB view details)

Uploaded Apr 19, 2026 Python 3

File details

Details for the file gatekeeper_ml-0.1.4.tar.gz.

File metadata

Download URL: gatekeeper_ml-0.1.4.tar.gz
Upload date: Apr 19, 2026
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`7c28a47ecea451d8a3cf4bfbc068d0d5caabbb93ca5d9e52312ea4fdc6817455`
MD5	`345a2426aac536088c77908cc22e29fb`
BLAKE2b-256	`7d91cf75a00c14429cdde79ab58ee8eff6241e651c19c58040ef52efc7f3bb21`

See more details on using hashes here.

File details

Details for the file gatekeeper_ml-0.1.4-py3-none-any.whl.

File metadata

Download URL: gatekeeper_ml-0.1.4-py3-none-any.whl
Upload date: Apr 19, 2026
Size: 2.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gatekeeper_ml-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24b3f161f5ec3fb3a5fc7cfe38074f414e0ea02c6759d1f957fd0ce97b86c9b7`
MD5	`4ea6f006852a53c9d2754fc98894ea88`
BLAKE2b-256	`f737cff78669e122da6c9cce9c7a3bc2f397880e41c75b563c6fd5c7ce143ade`

See more details on using hashes here.

gatekeeper-ml 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Gatekeeper ML

Project Structure

1. Create a Virtual Environment

Windows PowerShell

macOS / Linux

2. Install Dependencies

3. Fetch and Prepare the Dataset

4. Train the Model

5. Run Predictions

6. Use as a Python Library

7. Build the Package

8. Publish to TestPyPI

Windows PowerShell

macOS / Linux

9. Publish to PyPI

10. Trusted Publishing

Notes

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes