Skip to main content

Semantic NLP intelligence toolkit — encoding, embeddings, GPU/CPU device handling, and reusable inference interfaces.

Project description

defenx-nlp

Semantic NLP Intelligence Toolkit

Extracted from DEFENX NeuroSight — Artificial Brain Cognitive Simulator. A domain-agnostic library for semantic sentence encoding, embedding generation, GPU/CPU-aware device handling, and reusable inference interfaces.

PyPI version Python License: MIT


Overview

defenx-nlp provides the intelligence layer that powers DEFENX NeuroSight, packaged as a standalone, pip-installable library. It is designed to be domain-agnostic so the same encoder that understands human chat intent can be repurposed for:

Use case What you embed
NLP classification User sentences → intent labels
Anomaly detection System log lines → outlier scores
Log intelligence Server events → semantic clusters
Behavioural analytics User actions → behavioural patterns
Semantic search Documents → retrieval ranking

Installation

Standard (CPU)

pip install defenx-nlp

With CUDA 12 (RTX 30/40 series, recommended)

pip install defenx-nlp
pip install torch --index-url https://download.pytorch.org/whl/cu128

Development install (editable + test tools)

git clone https://github.com/defenx-sec/defenx-nlp.git
cd defenx-nlp
pip install -e ".[dev]"

Quick Start

from defenx_nlp import SemanticEncoder

# Auto-detects CUDA — falls back to CPU silently
enc = SemanticEncoder()

# Encode a single sentence → (384,) float32 numpy array
embedding = enc.encode("Neural networks are universal approximators.")
print(embedding.shape)   # (384,)
print(embedding.dtype)   # float32

# Batch encode — much faster than looping
embeddings = enc.encode_batch(["Hello", "Goodbye", "Help me please"])
print(embeddings.shape)  # (3, 384)

Semantic similarity

from defenx_nlp import SemanticEncoder, cosine_similarity

enc = SemanticEncoder()
e1 = enc.encode("I love machine learning")
e2 = enc.encode("I enjoy deep learning")

sim = cosine_similarity(e1, e2)
print(f"Similarity: {sim:.3f}")   # ~0.87

Top-k retrieval

from defenx_nlp import SemanticEncoder, top_k_similar

enc = SemanticEncoder()
corpus = ["Help me", "Goodbye", "Great job!", "What is AI?"]
query  = "Can you assist me?"

c_embs = [enc.encode(t) for t in corpus]
q_emb  = enc.encode(query)

results = top_k_similar(q_emb, c_embs, k=1)
print(corpus[results[0][0]])   # "Help me"

Text preprocessing

from defenx_nlp import clean_text, batch_clean

text = clean_text("  HELLO  WORLD!  ", lowercase=True)
# → "hello world!"

texts = batch_clean(["  A  ", " B  "], lowercase=True)
# → ["a", "b"]

CUDA warmup (for production services)

enc = SemanticEncoder(lazy=False)
enc.warmup()   # initialise CuDNN kernels at startup, not first request

API Summary

Symbol Description
SemanticEncoder Main encoder class — lazy, thread-safe, CUDA-aware
BaseEncoder Abstract base for custom encoder backends
BaseInferenceEngine Abstract base for downstream classifiers
get_device(preferred) Resolve "auto"/"cuda"/"cpu"/"mps"torch.device
device_info() Hardware diagnostic dictionary
clean_text(text, **opts) Configurable single-text cleaner
batch_clean(texts, **opts) Apply clean_text to a list
truncate(text, max_chars) Hard-truncate with optional ellipsis
cosine_similarity(a, b) Scalar cosine similarity in [-1, 1]
batch_cosine_similarity(q, M) Vectorised query-vs-matrix similarity (N,)
top_k_similar(q, corpus, k) Top-k retrieval → [(idx, score)]
normalize_embedding(v) L2-normalise a single embedding
normalize_batch(M) Row-wise L2-normalise (N, D) matrix

Full API docs: docs/api_reference.md


Hardware Requirements

Minimum

Component Requirement
CPU Dual-core, 64-bit
RAM 4 GB
Disk 500 MB (model cache)
GPU None (CPU mode)
Python 3.9+

Recommended

Component Requirement
CPU 6+ cores (AMD Ryzen 7 / Intel Core i7+)
RAM 16 GB
GPU NVIDIA RTX 20-series or newer
VRAM 4+ GB
CUDA 11.8 or 12.x
Python 3.11+

Tested on: AMD Ryzen 7 4800H + NVIDIA RTX 3050 6 GB (CUDA 12.8) on Kali Linux (WSL2). Average inference latency: ~15 ms/sentence on CUDA, ~80 ms on CPU.


Supported Operating Systems

OS CPU mode CUDA mode Notes
Linux (Ubuntu 20.04+, Debian 11+, Kali) Fully tested
Windows 10 / 11 Use WSL2 for CUDA in WSL
macOS 12+ (Intel) No NVIDIA CUDA support
macOS 12+ (Apple Silicon M1/M2/M3) MPS Use device="mps"

Extending the Library

Custom encoder backend

import numpy as np
import torch
from defenx_nlp import BaseEncoder

class OpenAIEncoder(BaseEncoder):
    """Drop-in encoder using OpenAI embeddings API."""

    def __init__(self, api_key: str):
        import openai
        openai.api_key = api_key
        self._client = openai.OpenAI()

    def encode(self, text: str) -> np.ndarray:
        resp = self._client.embeddings.create(
            model="text-embedding-3-small", input=text
        )
        return np.array(resp.data[0].embedding, dtype=np.float32)

    def encode_batch(self, texts):
        resp = self._client.embeddings.create(
            model="text-embedding-3-small", input=texts
        )
        return np.array([d.embedding for d in resp.data], dtype=np.float32)

    @property
    def embedding_dim(self) -> int: return 1536

    @property
    def device(self) -> torch.device: return torch.device("cpu")

Running Tests

# Install dev extras first
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ -v --cov=defenx_nlp --cov-report=term-missing

Expected output:

tests/test_encoder.py::TestSemanticEncoder::test_encode_shape          PASSED
tests/test_encoder.py::TestSemanticEncoder::test_embedding_dim_property PASSED
...
13 passed in 42.3s

Running Examples

# Basic single-sentence usage + similarity + retrieval
python examples/basic_usage.py

# Batch throughput benchmark + similarity matrix
python examples/batch_encoding.py

Publishing to PyPI

1. Build the distribution

pip install build twine
python -m build
# Creates dist/defenx_nlp-0.1.0.tar.gz and dist/defenx_nlp-0.1.0-py3-none-any.whl

2. Test on TestPyPI first (always)

twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ defenx-nlp

3. Publish to real PyPI

twine upload dist/*

4. Verify the install

pip install defenx-nlp
python -c "from defenx_nlp import SemanticEncoder; print(SemanticEncoder())"

Versioning

Update version in pyproject.toml before each release. Follow Semantic Versioning: MAJOR.MINOR.PATCH.


Project Structure

defenx-nlp/
├── defenx_nlp/
│   ├── __init__.py        Public API surface — all exports live here
│   ├── encoder.py         SemanticEncoder — lazy, thread-safe, CUDA-aware
│   ├── device.py          get_device() and device_info() helpers
│   ├── preprocessing.py   clean_text, batch_clean, truncate, deduplicate
│   ├── interfaces.py      BaseEncoder and BaseInferenceEngine ABCs
│   └── utils.py           cosine_similarity, top_k_similar, normalize_*
│
├── tests/
│   └── test_encoder.py    pytest suite — encoder, device, preprocessing, utils
│
├── examples/
│   ├── basic_usage.py     Single-sentence encode, similarity, retrieval
│   └── batch_encoding.py  Throughput benchmark, similarity matrix
│
├── docs/
│   └── api_reference.md   Full API documentation
│
├── README.md              This file
├── pyproject.toml         PEP 621 package metadata + build config
└── LICENSE                MIT

License

MIT — see LICENSE.


Acknowledgements

Built on top of:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

defenx_nlp-0.1.1.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

defenx_nlp-0.1.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file defenx_nlp-0.1.1.tar.gz.

File metadata

  • Download URL: defenx_nlp-0.1.1.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for defenx_nlp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4912972ba8d7b0a41bd79b53ed870a0c4b864025884cc34bb1c4ed73d6cdf269
MD5 78575241a2c0de474b5909a2fa7df4ae
BLAKE2b-256 472dfd04fe0437ec3759b46b228a2188688cda224c5f4bae693cdb96273e0d8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for defenx_nlp-0.1.1.tar.gz:

Publisher: publish.yml on defenx-sec/defenx-nlp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file defenx_nlp-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: defenx_nlp-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for defenx_nlp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0c25f195d3eefeab949bb601170b76187c7950a93abfee50c01f7c63b2526e05
MD5 98782b84d88c0c4598852a2ae80c0300
BLAKE2b-256 e76fd447bfdbefa088a8950328c455d813e58e7d7f6f5234cf7f73751d850c14

See more details on using hashes here.

Provenance

The following attestation bundles were made for defenx_nlp-0.1.1-py3-none-any.whl:

Publisher: publish.yml on defenx-sec/defenx-nlp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page