Skip to main content

Lightweight NLP components for semantic processing of domain-specific content.

Project description

Description

Structured data in technical domains (e.g. engineering, meteorology) often contain specialized terminology, measurement units, parameter specifications, and symbolic values. These elements pose a challenge for similarity methods based solely on embeddings due to their limited semantic resolution.

This package follows a hybrid approach, in which rule-based processing, NLP-based filtering, and embeddings can be combined so that domain-specific entities are identified and organized across multiple levels of abstraction, enabling interpretable and reproducible retrieval workflows.

The package integrates lightweight components into existing NLP pipelines. These components are designed to work without relying on large language models (LLMs) and to structure relevant data using deterministic and auditable mechanisms.

Additional modules are planned to support structured query generation, including:

  • Semantic Logic Composer: Parses natural-language input and produces a logical structure enriched with extracted entities. This structure can be used as a basis for formats such as SQL, JSON or YAML.

Structured NLP Workflow

The following figures illustrate the core motivation and design focus of this package. They outline the typical stages of a structured NLP pipeline and highlight the specific components where this package provides support.

Retrieval Process

This conceptual overview serves as a foundation for understanding the individual components, which are detailed in the next section.

Licence Agreement

Seanox Software Solutions is an open-source project, hereinafter referred to as Seanox.

This software is licensed under the Apache License, Version 2.0.

Copyright (C) 2025 Seanox Software Solutions

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

System Requirement

  • Python 3.10 or higher

Installation & Setup

pip install seanox-ai-nlp

Packages & Modules

units

The units module applies rule-based, deterministic pattern recognition to identify numerical expressions and measurement units in text. It is designed for integration into lightweight NLP pipelines and does not rely on large language models (LLMs). Its language-agnostic architecture and flexible formatting support a broad range of use cases, including general, semi-technical and semi-academic content.

The module can be integrated with tools such as spaCy’s EntityRuler, enabling annotation, filtering, and token alignment workflows. It produces structured output suitable for downstream analysis, without performing semantic interpretation.

Features

  • Pattern-based extraction
    Identifies constructs like 5 km, -20 ºC, or 1000 hPa using regular expressions and token patterns -- no training required.
  • Language-independent architecture
    Operates at token and character level; applicable across multilingual content.
  • Support for compound expressions
    Recognizes unit combinations (km/h, kWh/m², g/cm³) and numerical constructs involving signs and operators: ±, ×, ·, :, /, ^, – and more.
  • Integration-ready output
    Returns structured entities compatible with tools like spaCy’s EntityRuler.

Quickstart

from seanox_ai_nlp.units import units
text = "The cruising speed of the Boeing 747 is approximately 900 km/h (559 mph)."
for entity in units(text):
    print(entity)

synthetics

The synthetics module generates annotated natural language from structured input data -- such as records from databases or knowledge graphs. It uses template-based, rule-driven methods to produce controlled and annotated sentences. Designed for deterministic NLP pipelines, it avoids large language models (LLMs) and supports reproducible generation.

Features

  • Template-Based Text Generation
    Produces natural-language output from structured input using YAML-defined Jinja2 templates. Template selection is context-sensitive.
  • Stochastic Variation
    Filters such as random_set, random_range, and random_range_join_phrase introduce lexical and syntactic diversity from identical data structures.
  • Domain-Specific Annotation
    Annotates entities with structured markers for precise extraction and control.
  • Rule-Based Span Detection
    Identifies semantic spans using regular expressions, independent of tokenization or parsing.
  • Interpretation-Free Generation
    Output is deterministic and reproducible; no semantic analysis is performed.
  • NLP Pipeline compatibility
    The Synthetic object includes raw and annotated text, entity spans and regex-based semantic spans. Compatible with spaCy-style frameworks for fine-tuning, evaluation, and augmentation.

Quickstart

from seanox_ai_nlp.synthetics import synthetics
import json

with open("synthetics-planets_en.json", encoding="utf-8") as file:
    datas = json.load(file)
    
for data in datas:
    synthetic = synthetics(".", "synthetics_en_annotate.yaml", data)
    print(synthetic)

Changes

1.3.0 20251001

BF: Python: Corrections/optimizations of dependencies
BF: synthetics: Correction for empty templates / missing segments
BF: synthetics: Consistent use of the parameter pattern for RegEx in spans
CR: Python: Increased the requirement to Python 3.10 or higher
CR: synthetics: Added schema and validation for template YAML
CR: synthetics: Added custom filters for template rendering
CR: synthetics: Template section span - regex added support for labels

Read more

Contact

Issues
Requests
Mail

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seanox_ai_nlp-1.3.0.tar.gz (203.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seanox_ai_nlp-1.3.0-py3-none-any.whl (72.5 kB view details)

Uploaded Python 3

File details

Details for the file seanox_ai_nlp-1.3.0.tar.gz.

File metadata

  • Download URL: seanox_ai_nlp-1.3.0.tar.gz
  • Upload date:
  • Size: 203.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for seanox_ai_nlp-1.3.0.tar.gz
Algorithm Hash digest
SHA256 525427a99390e4ba66cafc8513200873634d471dacbc746d7f52bdfc6f56b3d4
MD5 0870dee82ce79bce68b190b2aed553b2
BLAKE2b-256 0fc2bb6c223b3e83e1cc5f53f719a77e160c37cebc61e8ad033a14ab91cd1ee8

See more details on using hashes here.

File details

Details for the file seanox_ai_nlp-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: seanox_ai_nlp-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 72.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for seanox_ai_nlp-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e8109f38405bceda34b8e0aaa3f48c9688f9aed06bb6e80807ae6e95ba34a65
MD5 7898c5dcd8fa89218de8cc9899a8817c
BLAKE2b-256 86152a22a81053ba831d49a486165708de76516683534783e8adb52c3cbc4281

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page