Skip to main content

Lightweight NLP components for semantic processing of domain-specific content.

Project description

Description

Structured data in technical domains (e.g. engineering, meteorology) often contain specialized terminology, measurement units, parameter specifications, and symbolic values. These elements pose challenges for embedding-based similarity methods due to limited semantic resolution.

This package follows a hybrid approach that combines rule-based processing with NLP-based filtering. It does not rely on embedding-based retrieval methods. Instead, it explicitly identifies domain-specific entities and organizes them across multiple abstraction levels to support interpretable and reproducible retrieval workflows.

The package integrates lightweight components into existing NLP pipelines. These components operate independently of large language models (LLMs) and are designed to structure relevant data using deterministic and auditable mechanisms.

Additional modules are planned to support structured query generation, including:

  • Logic Query Composer: Parses natural-language input and produces a logical structure enriched with extracted entities. This structure can be used as a basis for formats such as SQL, JSON or YAML.

Structured NLP Workflow

The following figures illustrate the core motivation and design focus of this package design. They outline the typical stages of a structured NLP pipeline and highlights the specific components where this package provides support.

Training Pipeline

---
config:
  theme: neutral
---
flowchart TD
    subgraph subGraph1[" "]
        A["Structured Data"]
        subgraph subGraph1-1["synthetics + units"]
            B["Synthetic Annotated Training Sentences"]
        end
        C["NLP Component Update"]
    end
    A --> B
    B --> C
    style subGraph1-1 fill:#BBDEFB    
    style A fill:#FFFFFF
    style B fill:#FFFFFF

Retrieval Process

---
config:
  theme: neutral
---
flowchart TD
    subgraph subGraph2["Retrieval Process"]
        D["Natural-language Query"]
        E["Entity Extraction"]
        subgraph subGraph2-1["logic query composer"]
            F["Semantic and Logical Analysis"]
        end
        G["Logical Structure"]
        H["Manual SQL Composition"]
        I["SQL"]
        J["Database Execution"]
        K["Retrieval"]
    end
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> J
    J --> K
    style subGraph2-1 fill:#BBDEFB
    style D fill:#FFFFFF
    style F fill:#FFFFFF
    style G fill:#FFFFFF
    style I fill:#FFFFFF
    style K fill:#FFFFFF

Feedback Loop (optional)

---
config:
  theme: neutral
---
flowchart TD
    subgraph subGraph3[" "]
        L["New Structured Data + Natural-language Query"]
        subgraph subGraph3-1["synthetics + units"]
            M["Update of Synthetic Annotated Training Sentences"]
        end
        N["NLP Component Update"]
    end
    L --> M
    M --> N
    style subGraph3-1 fill:#BBDEFB
    style L fill:#FFFFFF
    style M fill:#FFFFFF

This conceptual overview serves as a foundation for understanding the individual components, which are detailed in the next section.

Licence Agreement

Seanox Software Solutions is an open-source project, hereinafter referred to as Seanox.

This software is licensed under the Apache License, Version 2.0.

Copyright (C) 2025 Seanox Software Solutions

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

System Requirement

  • Python 3.9 or higher

Installation & Setup

pip install seanox-ai-nlp

Packages & Modules

units

The units module applies rule-based, deterministic pattern recognition to identify numerical expressions and measurement units in text. It is designed for integration into lightweight NLP pipelines and does not rely on large language models (LLMs). Its language-agnostic architecture and flexible formatting support a broad range of use cases, including general, semi-technical and semi-academic content.

The module can be integrated with tools such as spaCy’s EntityRuler, enabling annotation, filtering, and token alignment workflows. It produces structured output suitable for downstream analysis, without performing semantic interpretation.

Features

  • Pattern-based extraction
    Identifies constructs like 5 km, -20 ºC, or 1000 hPa using regular expressions and token patterns -- no training required.
  • Language-independent architecture
    Operates at token and character level; applicable across multilingual content.
  • Support for compound expressions
    Recognizes unit combinations (km/h, kWh/m², g/cm³) and numerical constructs involving signs and operators: ±, ×, ·, :, /, ^, – and more.
  • Integration-ready output
    Returns structured entities compatible with tools like spaCy’s EntityRuler.

Quickstart

from seanox_ai_nlp.units import units
text = "The cruising speed of the Boeing 747 is approximately 900 km/h (559 mph)."
for entity in units(text):
    print(entity)

synthetics

The synthetics module generates annotated natural language from structured input data -- such as records from databases or knowledge graphs. It uses template-based, rule-driven methods to produce controlled and annotated sentences. Designed for deterministic NLP pipelines, it avoids large language models (LLMs) and supports reproducible generation.

Features

  • Template-Based Text Generation
    Produces natural-language output from structured input using YAML-defined Jinja2 templates. Template selection is context-sensitive.
  • Stochastic Variation
    Filters such as random_set, random_range, and
  • random_range_join_phrase introduce lexical and syntactic diversity from identical data structures.
  • Domain-Specific Annotation
    Annotates entities with structured markers for precise extraction and control.
  • Rule-Based Span Detection
    Identifies semantic spans using regular expressions, independent of tokenization or parsing.
  • Interpretation-Free Generation
    Output is deterministic and reproducible; no semantic analysis is performed.
  • NLP Pipeline compatibility
    The Synthetic object includes raw and annotated text, entity spans and regex-based semantic spans. Compatible with spaCy-style frameworks for fine-tuning, evaluation, and augmentation.

Quickstart

from seanox_ai_nlp.synthetics import synthetics
import json

with open("synthetics-planets_en.json", encoding="utf-8") as file:
    datas = json.load(file)
    
for data in datas:
    synthetic = synthetics(".", "synthetics_en_annotate.yaml", data)
    print(synthetic)

Changes

1.1.0 20250823

BF: units Corrections/optimizations of categorization
BF: Documentation: Corrections/optimizations
BF: Build: Corrections/optimizations in pyproject.toml/setup.py
CR: units Renaming of the UNIT-VALUE label to MEASURE
CR: units Added unit hl / hL for Hectoliters
CR: units Separation of SI prefixes: Multiples / Submultiples
CR: synthetics: Added to generate semantic sentences

Read more

Contact

Issues
Requests
Mail

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seanox_ai_nlp-1.1.0.tar.gz (87.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seanox_ai_nlp-1.1.0-py3-none-any.whl (68.1 kB view details)

Uploaded Python 3

File details

Details for the file seanox_ai_nlp-1.1.0.tar.gz.

File metadata

  • Download URL: seanox_ai_nlp-1.1.0.tar.gz
  • Upload date:
  • Size: 87.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for seanox_ai_nlp-1.1.0.tar.gz
Algorithm Hash digest
SHA256 d89f8c9443f929faee8d1fe5c5db18d0794cf31388c43fce48e6f2127b091c95
MD5 b630e226420a9b88274561b27b6ac9a6
BLAKE2b-256 efee60198122a689d96975638b40f1e498f9c0e82f7636e7480c0a6e1adeec25

See more details on using hashes here.

File details

Details for the file seanox_ai_nlp-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: seanox_ai_nlp-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 68.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for seanox_ai_nlp-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b25e38d66d73769552609277900d6e9f57fd98844cde1484d98bb866f07b25f1
MD5 3dd444c75431fdf4b90b56b5eeee89d5
BLAKE2b-256 8d644118bf37872982960c42b0d6895e620964d6e179b8c46ec015f1f6790944

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page