Lightweight NLP components for semantic processing of domain-specific content.
Project description
Description
Structured data in technical domains (e.g. engineering, meteorology) often contain specialized terminology, measurement units, parameter specifications, and symbolic values. These elements pose a challenge for similarity methods based solely on embeddings due to their limited semantic resolution.
This package follows a hybrid approach, in which rule-based processing, NLP-based filtering, and embeddings can be combined so that domain-specific entities are identified and organized across multiple levels of abstraction, enabling interpretable and reproducible retrieval workflows.
The package integrates lightweight components into existing NLP pipelines. These components are designed to work without relying on large language models (LLMs) and to structure relevant data using deterministic and auditable mechanisms.
Additional modules are planned to support structured query generation, including:
- Semantic Logic Composer: Parses natural-language input and produces a logical structure enriched with extracted entities. This structure can be used as a basis for formats such as SQL, JSON or YAML.
Structured NLP Workflow
The following figures illustrate the core motivation and design focus of this package. They outline the typical stages of a structured NLP pipeline and highlight the specific components where this package provides support.
This conceptual overview serves as a foundation for understanding the individual components, which are detailed in the next section.
Licence Agreement
Seanox Software Solutions is an open-source project, hereinafter referred to as Seanox.
This software is licensed under the Apache License, Version 2.0.
Copyright (C) 2025 Seanox Software Solutions
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
System Requirement
- Python 3.10 or higher
Installation & Setup
pip install seanox-ai-nlp
Packages & Modules
units
The units module applies rule-based, deterministic pattern recognition to identify numerical expressions and measurement units in text. It is designed for integration into lightweight NLP pipelines and does not rely on large language models (LLMs). Its language-agnostic architecture and flexible formatting support a broad range of use cases, including general, semi-technical and semi-academic content.
The module can be integrated with tools such as spaCy’s EntityRuler, enabling
annotation, filtering, and token alignment workflows. It produces structured
output suitable for downstream analysis, without performing semantic
interpretation.
Features
- Pattern-based extraction
Identifies constructs like 5 km, -20 ºC, or 1000 hPa using regular expressions and token patterns -- no training required. - Language-independent architecture
Operates at token and character level; applicable across multilingual content. - Support for compound expressions
Recognizes unit combinations (km/h, kWh/m², g/cm³) and numerical constructs involving signs and operators: ±, ×, ·, :, /, ^, – and more. - Integration-ready output
Returns structured entities compatible with tools like spaCy’s EntityRuler.
Quickstart
from seanox_ai_nlp.units import units
text = "The cruising speed of the Boeing 747 is approximately 900 km/h (559 mph)."
for entity in units(text):
print(entity)
synthetics
The synthetics module generates annotated natural language from structured input data -- such as records from databases or knowledge graphs. It uses template-based, rule-driven methods to produce controlled and annotated sentences. Designed for deterministic NLP pipelines, it avoids large language models (LLMs) and supports reproducible generation.
Features
- Template-Based Text Generation
Produces natural-language output from structured input using YAML-defined Jinja2 templates. Template selection is context-sensitive. - Stochastic Variation
Filters such as random_set, random_range, and random_range_join_phrase introduce lexical and syntactic diversity from identical data structures. - Domain-Specific Annotation
Annotates entities with structured markers for precise extraction and control. - Rule-Based Span Detection
Identifies semantic spans using regular expressions, independent of tokenization or parsing. - Interpretation-Free Generation
Output is deterministic and reproducible; no semantic analysis is performed. - NLP Pipeline compatibility
The Synthetic object includes raw and annotated text, entity spans and regex-based semantic spans. Compatible with spaCy-style frameworks for fine-tuning, evaluation, and augmentation.
Quickstart
from seanox_ai_nlp.synthetics import synthetics
import json
with open("synthetics-planets_en.json", encoding="utf-8") as file:
datas = json.load(file)
for data in datas:
synthetic = synthetics(".", "synthetics_en_annotate.yaml", data)
print(synthetic)
Changes
1.3.0.1 20251009
BF: Release: Unwanted content in distribution (seanox_ai_nlp.whl / seanox_ai_nlp.gz)
1.3.0 20251001
BF: Python: Corrections/optimizations of dependencies
BF: synthetics: Correction for empty templates / missing segments
BF: synthetics: Consistent use of the parameter pattern for RegEx in spans
CR: Python: Increased the requirement to Python 3.10 or higher
CR: synthetics: Added schema and validation for template YAML
CR: synthetics: Added custom filters for template rendering
CR: synthetics: Template section span - regex added support for labels
Contact
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seanox_ai_nlp-1.3.0.1.tar.gz.
File metadata
- Download URL: seanox_ai_nlp-1.3.0.1.tar.gz
- Upload date:
- Size: 197.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6985dbb7a4f76220def681eb3b8f5d1a22f5d2a47a3b7a13cde85df8e45e49eb
|
|
| MD5 |
d963170c2326b261e23d87bb63bb3efc
|
|
| BLAKE2b-256 |
b0e937c039d9827ae0002ac969176772a03af8b74d197e1562c7b3d89667a327
|
File details
Details for the file seanox_ai_nlp-1.3.0.1-py3-none-any.whl.
File metadata
- Download URL: seanox_ai_nlp-1.3.0.1-py3-none-any.whl
- Upload date:
- Size: 69.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2c2ced55f86ad3f7e66c0c0c32038bd8a41f076a21eed114688a8e232b7103a
|
|
| MD5 |
4d69c7d62d9c9b65e27efe827ea7dbf9
|
|
| BLAKE2b-256 |
8d6c0c6cf1df416ae2b52b70068fea9e722a284120a0a7c2baa4f535fd720bdc
|