Lightweight payload classifier for HTTP log gatekeeping.
Project description
Gatekeeper ML
Gatekeeper ML is a lightweight binary classifier for HTTP log components. It uses character-level TF-IDF plus handcrafted text statistics to classify raw strings as either:
0for normal1for suspicious
The design goal is low-latency screening. Safe-looking values can bypass the ML model entirely through a conservative regex fast path, and only suspicious candidates need deeper downstream analysis.
Project Structure
gatekeeper-ml/
├── cli.py
├── data/
│ ├── raw/
│ └── processed/
├── models/
├── pyproject.toml
├── src/
│ └── gatekeeper_ml/
│ ├── __init__.py
│ ├── config.py
│ ├── data_loader.py
│ ├── features.py
│ ├── predict.py
│ ├── train.py
│ └── models/
│ └── payload_classifier.pkl
├── README.md
├── MANIFEST.in
└── requirements.txt
1. Create a Virtual Environment
From the gatekeeper-ml directory:
Windows PowerShell
python -m venv .venv
.venv\Scripts\Activate.ps1
macOS / Linux
python3 -m venv .venv
source .venv/bin/activate
2. Install Dependencies
pip install --upgrade pip
pip install -r requirements.txt
For local development as an installable package:
pip install -e .
To build and publish distribution artifacts, also install the packaging tools:
pip install build twine
3. Fetch and Prepare the Dataset
This command downloads suspicious payloads from PayloadsAllTheThings, generates normal HTTP-like samples, and writes the combined processed dataset to data/processed/.
python cli.py fetch
To force a fresh download from the upstream payload sources:
python cli.py fetch --force-refresh
4. Train the Model
This command:
- loads the processed dataset
- trains the RandomForest-based pipeline
- saves the serialized
.pklmodel artifact tomodels/ - prints Precision, Recall, F1-Score, and the recall-oriented threshold
python cli.py train
To rebuild the dataset before training:
python cli.py train --force-refresh
Expected artifacts:
models/gatekeeper_payload_classifier.pklmodels/training_metrics.json
5. Run Predictions
Pass one or more strings directly on the command line:
python cli.py predict "/api/v1/users/42" "<script>alert(1)</script>" "' OR 1=1 --"
Example output:
Prediction results
0 /api/v1/users/42
1 <script>alert(1)</script>
1 ' OR 1=1 --
Total inference time: 4.812 ms
Average per input: 1.604 ms
To point at a custom trained model:
python cli.py predict --model-path models/gatekeeper_payload_classifier.pkl "12345" "../../etc/passwd"
6. Use as a Python Library
After pip install -e ., you can import the classifier directly:
from gatekeeper_ml import PayloadClassifier
# Option 1: Default load
clf = PayloadClassifier()
# Option 2: Custom path load
clf = PayloadClassifier(model_path="path/to/your_model.pkl")
predictions = clf.predict_batch([
"/api/v1/users/42",
"<script>alert(1)</script>",
])
7. Build the Package
Before building a release, update the version in pyproject.toml.
Build both the source distribution and wheel:
python -m build
This creates the release artifacts in dist/:
dist/gatekeeper_ml-<version>.tar.gzdist/gatekeeper_ml-<version>-py3-none-any.whl
Validate the generated metadata before upload:
python -m twine check dist/*
8. Publish to TestPyPI
TestPyPI is the safest way to verify packaging before a real release.
Create a TestPyPI API token, then export it:
Windows PowerShell
$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."
macOS / Linux
export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."
Upload to TestPyPI:
python -m twine upload --repository testpypi dist/*
You can then verify installation from TestPyPI:
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple gatekeeper-ml
9. Publish to PyPI
After the TestPyPI release looks good, upload the same dist/ artifacts to PyPI:
python -m twine upload dist/*
Recommended release flow:
- Update the version in
pyproject.toml. - Remove old build artifacts if needed:
rm -rf dist build *.egg-infoor the Windows equivalent. - Run
python -m build. - Run
python -m twine check dist/*. - Publish to TestPyPI.
- Install and smoke-test from TestPyPI.
- Publish to PyPI.
10. Trusted Publishing
For CI/CD releases, PyPI now supports Trusted Publishing via OIDC, which avoids storing long-lived API tokens in your repository secrets. If you publish from GitHub Actions or another supported CI provider, this is the recommended long-term setup.
References:
- PyPI Trusted Publishing docs: https://docs.pypi.org/trusted-publishers/using-a-publisher/
- Twine project page: https://pypi.org/project/twine/
Notes
- The data loader prefers live GitHub payload sources but includes a bootstrap suspicious set so the pipeline remains usable when remote fetches fail.
- The
predictcommand loads the model once and evaluates inputs in batch form for low overhead. - Fast-path safe heuristics are intentionally conservative to preserve recall.
- A dedicated module usage guide is available at
docs/USAGE.md.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gatekeeper_ml-0.1.3.tar.gz.
File metadata
- Download URL: gatekeeper_ml-0.1.3.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5e467e35b4d197dc86c038c4aba0b4d0a13ca485d052c138a05847afbfe2629
|
|
| MD5 |
b9cd08cff32f4ec3c818b14c94b26e38
|
|
| BLAKE2b-256 |
d1272bdd1112a97696a157e8a37bd55ba8e5f881dd25d4c658953cfbf8bdfb75
|
File details
Details for the file gatekeeper_ml-0.1.3-py3-none-any.whl.
File metadata
- Download URL: gatekeeper_ml-0.1.3-py3-none-any.whl
- Upload date:
- Size: 2.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7af49d9f575e35d1a0a739873b3ca3e29f2254dbcfef57a7868300b729c71f36
|
|
| MD5 |
581ce10dd754f8556129e52ebe5c325f
|
|
| BLAKE2b-256 |
cfb69bc22e7531651edb931399a704eda1359cbb463dade23d679acd08e8ae03
|