Lightweight NLP components for semantic processing of domain-specific content.

These details have not been verified by PyPI

Project links

Project description

Description

Structured data in technical domains (e.g. engineering, meteorology) often contain specialized terminology, measurement units, parameter specifications, and symbolic values. These elements pose challenges for embedding-based similarity methods due to limited semantic resolution.

This package follows a hybrid approach that combines rule-based processing with NLP-based filtering. It does not rely on embedding-based retrieval methods. Instead, it explicitly identifies domain-specific entities and organizes them across multiple abstraction levels to support interpretable and reproducible retrieval workflows.

The package integrates lightweight components into existing NLP pipelines. These components operate independently of large language models (LLMs) and are designed to structure relevant data using deterministic and auditable mechanisms.

Additional modules are planned to support structured query generation, including:

Logic Query Composer: Parses natural-language input and produces a logical structure enriched with extracted entities. This structure can be used as a basis for formats such as SQL, JSON or YAML.

Structured NLP Workflow

The following figures illustrate the core motivation and design focus of this package design. They outline the typical stages of a structured NLP pipeline and highlights the specific components where this package provides support.

Training Pipeline

---
config:
  theme: neutral
---
flowchart TD
    subgraph subGraph1[" "]
        A["Structured Data"]
        subgraph subGraph1-1["synthetics + units"]
            B["Synthetic Annotated Training Sentences"]
        end
        C["NLP Component Update"]
    end
    A --> B
    B --> C
    style subGraph1-1 fill:#BBDEFB    
    style A fill:#FFFFFF
    style B fill:#FFFFFF

Retrieval Process

---
config:
  theme: neutral
---
flowchart TD
    subgraph subGraph2["Retrieval Process"]
        D["Natural-language Query"]
        E["Entity Extraction"]
        subgraph subGraph2-1["logic query composer"]
            F["Semantic and Logical Analysis"]
        end
        G["Logical Structure"]
        H["Manual SQL Composition"]
        I["SQL"]
        J["Database Execution"]
        K["Retrieval"]
    end
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> J
    J --> K
    style subGraph2-1 fill:#BBDEFB
    style D fill:#FFFFFF
    style F fill:#FFFFFF
    style G fill:#FFFFFF
    style I fill:#FFFFFF
    style K fill:#FFFFFF

Feedback Loop (optional)

---
config:
  theme: neutral
---
flowchart TD
    subgraph subGraph3[" "]
        L["New Structured Data + Natural-language Query"]
        subgraph subGraph3-1["synthetics + units"]
            M["Update of Synthetic Annotated Training Sentences"]
        end
        N["NLP Component Update"]
    end
    L --> M
    M --> N
    style subGraph3-1 fill:#BBDEFB
    style L fill:#FFFFFF
    style M fill:#FFFFFF

This conceptual overview serves as a foundation for understanding the individual components, which are detailed in the next section.

Licence Agreement

Seanox Software Solutions is an open-source project, hereinafter referred to as Seanox.

This software is licensed under the Apache License, Version 2.0.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

System Requirement

Python 3.9 or higher

Installation & Setup

pip install seanox-ai-nlp

Packages & Modules

units

The units module applies rule-based, deterministic pattern recognition to identify numerical expressions and measurement units in text. It is designed for integration into lightweight NLP pipelines and does not rely on large language models (LLMs). Its language-agnostic architecture and flexible formatting support a broad range of use cases, including general, semi-technical and semi-academic content.

The module can be integrated with tools such as spaCy’s EntityRuler, enabling annotation, filtering, and token alignment workflows. It produces structured output suitable for downstream analysis, without performing semantic interpretation.

Features

Pattern-based extraction
Identifies constructs like 5 km, -20 ºC, or 1000 hPa using regular expressions and token patterns -- no training required.
Language-independent architecture
Operates at token and character level; applicable across multilingual content.
Support for compound expressions
Recognizes unit combinations (km/h, kWh/m², g/cm³) and numerical constructs involving signs and operators: ±, ×, ·, :, /, ^, – and more.
Integration-ready output
Returns structured entities compatible with tools like spaCy’s EntityRuler.

Quickstart

from seanox_ai_nlp.units import units
text = "The cruising speed of the Boeing 747 is approximately 900 km/h (559 mph)."
for entity in units(text):
    print(entity)

synthetics

The synthetics module generates annotated natural language from structured input data -- such as records from databases or knowledge graphs. It uses template-based, rule-driven methods to produce controlled and annotated sentences. Designed for deterministic NLP pipelines, it avoids large language models (LLMs) and supports reproducible generation.

Features

Template-Based Text Generation
Produces natural-language output from structured input using YAML-defined Jinja2 templates. Template selection is context-sensitive.
Stochastic Variation
Filters such as random_set, random_range, and
random_range_join_phrase introduce lexical and syntactic diversity from identical data structures.
Domain-Specific Annotation
Annotates entities with structured markers for precise extraction and control.
Rule-Based Span Detection
Identifies semantic spans using regular expressions, independent of tokenization or parsing.
Interpretation-Free Generation
Output is deterministic and reproducible; no semantic analysis is performed.
NLP Pipeline compatibility
The Synthetic object includes raw and annotated text, entity spans and regex-based semantic spans. Compatible with spaCy-style frameworks for fine-tuning, evaluation, and augmentation.

Quickstart

from seanox_ai_nlp.synthetics import synthetics
import json

with open("synthetics-planets_en.json", encoding="utf-8") as file:
    datas = json.load(file)
    
for data in datas:
    synthetic = synthetics(".", "synthetics_en_annotate.yaml", data)
    print(synthetic)

Changes

1.1.0 20250823

BF: units Corrections/optimizations of categorization
BF: Documentation: Corrections/optimizations
BF: Build: Corrections/optimizations in pyproject.toml/setup.py
CR: units Renaming of the UNIT-VALUE label to MEASURE
CR: units Added unit hl / hL for Hectoliters
CR: units Separation of SI prefixes: Multiples / Submultiples
CR: synthetics: Added to generate semantic sentences

Contact

Issues
Requests
Mail

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.0.1

Oct 9, 2025

1.3.0

Sep 30, 2025

1.2.0

Sep 6, 2025

This version

1.1.0

Aug 23, 2025

1.0.0

Aug 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seanox_ai_nlp-1.1.0.tar.gz (87.8 kB view details)

Uploaded Aug 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

seanox_ai_nlp-1.1.0-py3-none-any.whl (68.1 kB view details)

Uploaded Aug 23, 2025 Python 3

File details

Details for the file seanox_ai_nlp-1.1.0.tar.gz.

File metadata

Download URL: seanox_ai_nlp-1.1.0.tar.gz
Upload date: Aug 23, 2025
Size: 87.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for seanox_ai_nlp-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d89f8c9443f929faee8d1fe5c5db18d0794cf31388c43fce48e6f2127b091c95`
MD5	`b630e226420a9b88274561b27b6ac9a6`
BLAKE2b-256	`efee60198122a689d96975638b40f1e498f9c0e82f7636e7480c0a6e1adeec25`

See more details on using hashes here.

File details

Details for the file seanox_ai_nlp-1.1.0-py3-none-any.whl.

File metadata

Download URL: seanox_ai_nlp-1.1.0-py3-none-any.whl
Upload date: Aug 23, 2025
Size: 68.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for seanox_ai_nlp-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b25e38d66d73769552609277900d6e9f57fd98844cde1484d98bb866f07b25f1`
MD5	`3dd444c75431fdf4b90b56b5eeee89d5`
BLAKE2b-256	`8d644118bf37872982960c42b0d6895e620964d6e179b8c46ec015f1f6790944`

See more details on using hashes here.

seanox-ai-nlp 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Description

Structured NLP Workflow

Training Pipeline

Retrieval Process

Feedback Loop (optional)

Licence Agreement

System Requirement

Installation & Setup

Packages & Modules

units

Features

Quickstart

synthetics

Features

Quickstart

Changes

1.1.0 20250823

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes