Semantic NLP intelligence toolkit — encoding, embeddings, GPU/CPU device handling, and reusable inference interfaces.
Project description
defenx-nlp
Semantic NLP Intelligence Toolkit
A domain-agnostic library for semantic sentence encoding, embedding generation, GPU/CPU-aware device handling, and reusable inference interfaces.
Overview
defenx-nlp is a standalone, pip-installable semantic NLP library. It is designed to be domain-agnostic so
the same encoder that understands human chat intent can be repurposed for:
| Use case | What you embed |
|---|---|
| NLP classification | User sentences → intent labels |
| Anomaly detection | System log lines → outlier scores |
| Log intelligence | Server events → semantic clusters |
| Behavioural analytics | User actions → behavioural patterns |
| Semantic search | Documents → retrieval ranking |
Installation
Standard (CPU)
pip install defenx-nlp
With CUDA 12 (RTX 30/40 series, recommended)
pip install defenx-nlp
pip install torch --index-url https://download.pytorch.org/whl/cu128
Development install (editable + test tools)
git clone https://github.com/defenx-sec/defenx-nlp.git
cd defenx-nlp
pip install -e ".[dev]"
Quick Start
from defenx_nlp import SemanticEncoder
# Auto-detects CUDA — falls back to CPU silently
enc = SemanticEncoder()
# Encode a single sentence → (384,) float32 numpy array
embedding = enc.encode("Neural networks are universal approximators.")
print(embedding.shape) # (384,)
print(embedding.dtype) # float32
# Batch encode — much faster than looping
embeddings = enc.encode_batch(["Hello", "Goodbye", "Help me please"])
print(embeddings.shape) # (3, 384)
Semantic similarity
from defenx_nlp import SemanticEncoder, cosine_similarity
enc = SemanticEncoder()
e1 = enc.encode("I love machine learning")
e2 = enc.encode("I enjoy deep learning")
sim = cosine_similarity(e1, e2)
print(f"Similarity: {sim:.3f}") # ~0.87
Top-k retrieval
from defenx_nlp import SemanticEncoder, top_k_similar
enc = SemanticEncoder()
corpus = ["Help me", "Goodbye", "Great job!", "What is AI?"]
query = "Can you assist me?"
c_embs = [enc.encode(t) for t in corpus]
q_emb = enc.encode(query)
results = top_k_similar(q_emb, c_embs, k=1)
print(corpus[results[0][0]]) # "Help me"
Text preprocessing
from defenx_nlp import clean_text, batch_clean
text = clean_text(" HELLO WORLD! ", lowercase=True)
# → "hello world!"
texts = batch_clean([" A ", " B "], lowercase=True)
# → ["a", "b"]
CUDA warmup (for production services)
enc = SemanticEncoder(lazy=False)
enc.warmup() # initialise CuDNN kernels at startup, not first request
API Summary
| Symbol | Description |
|---|---|
SemanticEncoder |
Main encoder class — lazy, thread-safe, CUDA-aware |
BaseEncoder |
Abstract base for custom encoder backends |
BaseInferenceEngine |
Abstract base for downstream classifiers |
get_device(preferred) |
Resolve "auto"/"cuda"/"cpu"/"mps" → torch.device |
device_info() |
Hardware diagnostic dictionary |
clean_text(text, **opts) |
Configurable single-text cleaner |
batch_clean(texts, **opts) |
Apply clean_text to a list |
truncate(text, max_chars) |
Hard-truncate with optional ellipsis |
cosine_similarity(a, b) |
Scalar cosine similarity in [-1, 1] |
batch_cosine_similarity(q, M) |
Vectorised query-vs-matrix similarity (N,) |
top_k_similar(q, corpus, k) |
Top-k retrieval → [(idx, score)] |
normalize_embedding(v) |
L2-normalise a single embedding |
normalize_batch(M) |
Row-wise L2-normalise (N, D) matrix |
Full API docs: docs/api_reference.md
Hardware Requirements
Minimum
| Component | Requirement |
|---|---|
| CPU | Dual-core, 64-bit |
| RAM | 4 GB |
| Disk | 500 MB (model cache) |
| GPU | None (CPU mode) |
| Python | 3.9+ |
Recommended
| Component | Requirement |
|---|---|
| CPU | 6+ cores (AMD Ryzen 7 / Intel Core i7+) |
| RAM | 16 GB |
| GPU | NVIDIA RTX 20-series or newer |
| VRAM | 4+ GB |
| CUDA | 11.8 or 12.x |
| Python | 3.11+ |
Tested on: AMD Ryzen 7 4800H + NVIDIA RTX 3050 6 GB (CUDA 12.8) on Kali Linux (WSL2). Average inference latency: ~15 ms/sentence on CUDA, ~80 ms on CPU.
Supported Operating Systems
| OS | CPU mode | CUDA mode | Notes |
|---|---|---|---|
| Linux (Ubuntu 20.04+, Debian 11+, Kali) | ✅ | ✅ | Fully tested |
| Windows 10 / 11 | ✅ | ✅ | Use WSL2 for CUDA in WSL |
| macOS 12+ (Intel) | ✅ | — | No NVIDIA CUDA support |
| macOS 12+ (Apple Silicon M1/M2/M3) | ✅ | MPS | Use device="mps" |
Extending the Library
Custom encoder backend
import numpy as np
import torch
from defenx_nlp import BaseEncoder
class OpenAIEncoder(BaseEncoder):
"""Drop-in encoder using OpenAI embeddings API."""
def __init__(self, api_key: str):
import openai
openai.api_key = api_key
self._client = openai.OpenAI()
def encode(self, text: str) -> np.ndarray:
resp = self._client.embeddings.create(
model="text-embedding-3-small", input=text
)
return np.array(resp.data[0].embedding, dtype=np.float32)
def encode_batch(self, texts):
resp = self._client.embeddings.create(
model="text-embedding-3-small", input=texts
)
return np.array([d.embedding for d in resp.data], dtype=np.float32)
@property
def embedding_dim(self) -> int: return 1536
@property
def device(self) -> torch.device: return torch.device("cpu")
Running Tests
# Install dev extras first
pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# With coverage
pytest tests/ -v --cov=defenx_nlp --cov-report=term-missing
Expected output:
tests/test_encoder.py::TestSemanticEncoder::test_encode_shape PASSED
tests/test_encoder.py::TestSemanticEncoder::test_embedding_dim_property PASSED
...
13 passed in 42.3s
Running Examples
# Basic single-sentence usage + similarity + retrieval
python examples/basic_usage.py
# Batch throughput benchmark + similarity matrix
python examples/batch_encoding.py
Publishing to PyPI
1. Build the distribution
pip install build twine
python -m build
# Creates dist/defenx_nlp-0.1.0.tar.gz and dist/defenx_nlp-0.1.0-py3-none-any.whl
2. Test on TestPyPI first (always)
twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ defenx-nlp
3. Publish to real PyPI
twine upload dist/*
4. Verify the install
pip install defenx-nlp
python -c "from defenx_nlp import SemanticEncoder; print(SemanticEncoder())"
Versioning
Update version in pyproject.toml before each release.
Follow Semantic Versioning: MAJOR.MINOR.PATCH.
Project Structure
defenx-nlp/
├── defenx_nlp/
│ ├── __init__.py Public API surface — all exports live here
│ ├── encoder.py SemanticEncoder — lazy, thread-safe, CUDA-aware
│ ├── device.py get_device() and device_info() helpers
│ ├── preprocessing.py clean_text, batch_clean, truncate, deduplicate
│ ├── interfaces.py BaseEncoder and BaseInferenceEngine ABCs
│ └── utils.py cosine_similarity, top_k_similar, normalize_*
│
├── tests/
│ └── test_encoder.py pytest suite — encoder, device, preprocessing, utils
│
├── examples/
│ ├── basic_usage.py Single-sentence encode, similarity, retrieval
│ └── batch_encoding.py Throughput benchmark, similarity matrix
│
├── docs/
│ └── api_reference.md Full API documentation
│
├── README.md This file
├── pyproject.toml PEP 621 package metadata + build config
└── LICENSE MIT
License
MIT — see LICENSE.
Acknowledgements
Built on top of:
- sentence-transformers by UKPLab
- PyTorch by Meta AI
- all-MiniLM-L6-v2 by Microsoft
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file defenx_nlp-0.2.0.tar.gz.
File metadata
- Download URL: defenx_nlp-0.2.0.tar.gz
- Upload date:
- Size: 20.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b6bc6c0b9cd12a7246aa2b8f3322be5a0f71de72babe8020124bd9598a34d5c
|
|
| MD5 |
df03bc88634697066406c688cfe895be
|
|
| BLAKE2b-256 |
e6ce81cb46c3d0efd7f60176da547301359e94d29242a298d9fd87086db183a0
|
Provenance
The following attestation bundles were made for defenx_nlp-0.2.0.tar.gz:
Publisher:
publish.yml on defenx-sec/defenx-nlp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
defenx_nlp-0.2.0.tar.gz -
Subject digest:
2b6bc6c0b9cd12a7246aa2b8f3322be5a0f71de72babe8020124bd9598a34d5c - Sigstore transparency entry: 975691177
- Sigstore integration time:
-
Permalink:
defenx-sec/defenx-nlp@caef398973c7bf0f7f149556b979d623eacb5ac2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/defenx-sec
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@caef398973c7bf0f7f149556b979d623eacb5ac2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file defenx_nlp-0.2.0-py3-none-any.whl.
File metadata
- Download URL: defenx_nlp-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da2afd74afaa3686d25b1e86078f5daca6e1e60546ab3cb27ddf248804bbcaf0
|
|
| MD5 |
02f35079e22604cbd8526c7be30a5189
|
|
| BLAKE2b-256 |
26766f8f14016906fa647608ac3f44332662ec8cb9ea89d80fd132df4e15b84c
|
Provenance
The following attestation bundles were made for defenx_nlp-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on defenx-sec/defenx-nlp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
defenx_nlp-0.2.0-py3-none-any.whl -
Subject digest:
da2afd74afaa3686d25b1e86078f5daca6e1e60546ab3cb27ddf248804bbcaf0 - Sigstore transparency entry: 975691178
- Sigstore integration time:
-
Permalink:
defenx-sec/defenx-nlp@caef398973c7bf0f7f149556b979d623eacb5ac2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/defenx-sec
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@caef398973c7bf0f7f149556b979d623eacb5ac2 -
Trigger Event:
push
-
Statement type: