Skip to main content

Offline semantic numeric lexicon - deterministic word-to-code mapping

Project description

Oyemi Logo

Oyemi

License PyPI Python

Offline Semantic Numeric Lexicon

Oyemi is a deterministic, high-performance semantic encoding library. It maps words to numeric codes that capture semantic meaning, enabling fast similarity calculations and synonym discovery without neural networks.

Key Features

  • Zero Runtime Dependencies - No WordNet, NLTK, or ML models needed at runtime
  • Deterministic Codes - Same word always produces same codes
  • Fast Lookups - SQLite with memory mapping (~0.01ms per lookup)
  • TRUE Synonym Finder - Find synonyms using WordNet synset matching
  • Antonym Detection - Identify antonyms (happy/unhappy, good/bad) with low similarity
  • Semantic Distance - Calculate word similarity using codes
  • Sentiment/Valence - Built-in positive/negative classification (SentiWordNet + antonym inference)
  • Lemma Fallback - Automatically handles word variations (running -> run)
  • Polysemy Support - Multiple codes for words with multiple meanings

Installation

pip install oyemi

Quick Start

from Oyemi import encode, semantic_similarity, find_synonyms, Encoder

# Simple encoding
codes = encode("happy")
# ['0122-00042-3-2-1']

# Check similarity
sim = semantic_similarity("happy", "joyful")
# 0.85

# Find TRUE synonyms
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']

# Full encoder with parsed codes
enc = Encoder()
parsed = enc.encode_parsed("worried")
for code in parsed:
    print(f"{code.raw}: {code.pos_name}, {code.valence_name}")
    # 3999-04518-3-1-2: adjective, negative

Code Format

Codes follow the format: HHHH-LLLLL-P-A-V

Component Meaning Values
HHHH Semantic superclass 0001-9999 (100+ categories)
LLLLL Synset ID 00001-99999
P Part of speech 1=noun, 2=verb, 3=adj, 4=adv
A Abstractness 0=concrete, 1=mixed, 2=abstract
V Valence 0=neutral, 1=positive, 2=negative

Superclass Categories

Oyemi includes 100+ semantic categories for precise classification:

Range Domain Examples
0100-0199 Emotions fear, joy, anger, sadness
0200-0299 Work/Business job, salary, manager, career
0300-0399 Communication speak, write, message
0400-0499 Cognition think, know, believe
0500-0599 Social family, friend, group
1000-1999 Physical/Concrete object, place, body
2000-2999 Actions move, create, change
3000-3999 Properties size, color, quality

API Reference

Encoding

from Oyemi import encode, Encoder

# Simple function
codes = encode("word")  # Returns List[str]

# Full encoder
enc = Encoder()
codes = enc.encode("word")              # List[str]
parsed = enc.encode_parsed("word")      # List[SemanticCode]
primary = enc.get_primary_code("word")  # str
exists = enc.contains("word")           # bool
batch = enc.encode_batch(["a", "b"])    # Dict[str, List[str]]

Synonym Discovery

from Oyemi import find_synonyms, Encoder

# Simple usage
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']

# With filters (default: all enabled)
enc = Encoder()
synonyms = enc.find_synonyms(
    "fear",
    limit=10,
    pos_lock=True,           # Only same part-of-speech
    abstractness_lock=True,  # Don't mix abstract/concrete
    return_weighted=False    # Return list of words
)

# Get weighted synonyms (for ranking)
weighted = enc.find_synonyms("fear", return_weighted=True)
# [('dread', 1.0), ('fright', 1.0), ('awe', 0.5)]
# Weight 1.0 = same superclass, 0.5 = different superclass

How it works: Words with the same HHHH-LLLLL (superclass + synset ID) are TRUE synonyms - they come from the same WordNet synset.

Similarity

from Oyemi import semantic_similarity, word_distance, find_similar

# Similarity (0-1, higher = more similar)
sim = semantic_similarity("cat", "dog")

# Distance with details
dist, result = word_distance("cat", "dog")
print(result.shared_superclass)  # True
print(result.same_pos)           # True

# Find similar words from candidates
similar = find_similar("happy", ["sad", "joyful", "angry", "content"])
# [("joyful", 0.85), ("content", 0.72), ...]

Clustering

from Oyemi import cluster_by_superclass

words = ["dog", "cat", "run", "walk", "happy", "sad"]
clusters = cluster_by_superclass(words)
# {'0011': ['dog', 'cat'], '2002': ['run', 'walk'], ...}

Sentiment/Valence

from Oyemi import Encoder

enc = Encoder()

# Check word valence
for word in ["happy", "sad", "worried", "excellent", "terrible"]:
    parsed = enc.encode_parsed(word)
    if parsed:
        valence = parsed[0].valence_name
        print(f"{word}: {valence}")

# Output:
# happy: positive
# sad: negative
# worried: negative
# excellent: positive
# terrible: negative

SemanticCode Object

from Oyemi import Encoder

enc = Encoder()
codes = enc.encode_parsed("run")

for code in codes:
    print(code.raw)              # "2001-00042-2-1-0"
    print(code.superclass)       # "2001"
    print(code.synset_id)        # "00042"
    print(code.pos)              # 2
    print(code.pos_name)         # "verb"
    print(code.abstractness)     # 1
    print(code.abstractness_name)# "mixed"
    print(code.valence)          # 0
    print(code.valence_name)     # "neutral"

Exceptions

from Oyemi import (
    OyemiError,           # Base exception
    UnknownWordError,     # Word not in lexicon
    LexiconNotFoundError, # Database file missing
    InvalidCodeError,     # Malformed code string
)

try:
    codes = encode("xyznotaword")
except UnknownWordError as e:
    print(f"Unknown: {e.word}")

Building the Lexicon

The lexicon database is pre-built and included. To rebuild from WordNet:

# Install build dependencies
pip install oyemi[build]

# Download WordNet and SentiWordNet
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4'); nltk.download('sentiwordnet')"

# Build lexicon
python tools/build_lexicon.py

# Validate
python tools/validate_lexicon.py

Use Cases

  • Taxonomy Expansion - Expand keyword lists with true synonyms
  • Fast Text Similarity - Compare documents without embeddings
  • Sentiment Analysis - Quick valence detection
  • Semantic Clustering - Group words by meaning
  • Feature Engineering - Convert words to numeric features
  • Offline NLP - No API calls or model downloads
  • Deterministic Pipelines - Reproducible results

Example: Expand Sentiment Keywords

from Oyemi import Encoder

enc = Encoder()

# Original keywords
negative_words = ["fear", "worried", "anxious", "stressed"]

# Expand with synonyms
expanded = set(negative_words)
for word in negative_words:
    synonyms = enc.find_synonyms(word, limit=5)
    expanded.update(synonyms)

print(f"Expanded: {len(negative_words)} -> {len(expanded)} words")
# Expanded: 4 -> 20 words

Performance

Operation Time
Single lookup ~0.01ms
Batch (1000 words) ~5ms
Similarity ~0.02ms
Find synonyms ~0.1ms

Versioning

  • Codes never change once released (semantic stability)
  • New words get new codes in minor versions
  • Schema changes require major version bump

Author

Kaossara Osseni

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oyemi-3.1.0.tar.gz (11.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oyemi-3.1.0-py3-none-any.whl (11.1 MB view details)

Uploaded Python 3

File details

Details for the file oyemi-3.1.0.tar.gz.

File metadata

  • Download URL: oyemi-3.1.0.tar.gz
  • Upload date:
  • Size: 11.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for oyemi-3.1.0.tar.gz
Algorithm Hash digest
SHA256 8ee654c11254f97b05cfec1e6f519e5cc71d3c206b2968cf11bf49597f08c5c1
MD5 b4a0503917438f0d8c99f45e6dbc9a2c
BLAKE2b-256 f6a233c612c5c250c64e98a8b605bbed2de8de15ec0a237a651bae9ea9c9b423

See more details on using hashes here.

File details

Details for the file oyemi-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: oyemi-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for oyemi-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afb8165fc980d8941cb43360cce8e4082db61465456c0e1588f0842f7565c630
MD5 343c55145fe52c67018a708b48086ce5
BLAKE2b-256 ed4aaa8fe76188422b1eb7fd0d8510719b07e4db0c01b5ae4218bfcd8e1953d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page