Skip to main content

Offline semantic numeric lexicon - deterministic word-to-code mapping

Project description

Oyemi Logo

Oyemi

License PyPI Python

Offline Semantic Numeric Lexicon

Oyemi is a deterministic, high-performance semantic encoding library. It maps words to numeric codes that capture semantic meaning, enabling fast similarity calculations and synonym discovery without neural networks.

Key Features

  • Zero Runtime Dependencies - No WordNet, NLTK, or ML models needed at runtime
  • Deterministic Codes - Same word always produces same codes
  • Fast Lookups - SQLite with memory mapping (~0.01ms per lookup)
  • TRUE Synonym Finder - Find synonyms using WordNet synset matching
  • Antonym Detection - Identify antonyms (happy/unhappy, good/bad) with low similarity
  • Semantic Distance - Calculate word similarity using codes
  • Sentiment/Valence - Built-in positive/negative classification (SentiWordNet + antonym inference)
  • Lemma Fallback - Automatically handles word variations (running -> run)
  • Polysemy Support - Multiple codes for words with multiple meanings

Installation

pip install oyemi

Quick Start

from Oyemi import encode, semantic_similarity, find_synonyms, Encoder

# Simple encoding
codes = encode("happy")
# ['0122-00042-3-2-1']

# Check similarity
sim = semantic_similarity("happy", "joyful")
# 0.85

# Find TRUE synonyms
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']

# Full encoder with parsed codes
enc = Encoder()
parsed = enc.encode_parsed("worried")
for code in parsed:
    print(f"{code.raw}: {code.pos_name}, {code.valence_name}")
    # 3999-04518-3-1-2: adjective, negative

Code Format

Codes follow the format: HHHH-LLLLL-P-A-V

Component Meaning Values
HHHH Semantic superclass 0001-9999 (100+ categories)
LLLLL Synset ID 00001-99999
P Part of speech 1=noun, 2=verb, 3=adj, 4=adv
A Abstractness 0=concrete, 1=mixed, 2=abstract
V Valence 0=neutral, 1=positive, 2=negative

Superclass Categories

Oyemi includes 100+ semantic categories for precise classification:

Range Domain Examples
0100-0199 Emotions fear, joy, anger, sadness
0200-0299 Work/Business job, salary, manager, career
0300-0399 Communication speak, write, message
0400-0499 Cognition think, know, believe
0500-0599 Social family, friend, group
1000-1999 Physical/Concrete object, place, body
2000-2999 Actions move, create, change
3000-3999 Properties size, color, quality

API Reference

Encoding

from Oyemi import encode, Encoder

# Simple function
codes = encode("word")  # Returns List[str]

# Full encoder
enc = Encoder()
codes = enc.encode("word")              # List[str]
parsed = enc.encode_parsed("word")      # List[SemanticCode]
primary = enc.get_primary_code("word")  # str
exists = enc.contains("word")           # bool
batch = enc.encode_batch(["a", "b"])    # Dict[str, List[str]]

Synonym Discovery

from Oyemi import find_synonyms, Encoder

# Simple usage
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']

# With filters (default: all enabled)
enc = Encoder()
synonyms = enc.find_synonyms(
    "fear",
    limit=10,
    pos_lock=True,           # Only same part-of-speech
    abstractness_lock=True,  # Don't mix abstract/concrete
    return_weighted=False    # Return list of words
)

# Get weighted synonyms (for ranking)
weighted = enc.find_synonyms("fear", return_weighted=True)
# [('dread', 1.0), ('fright', 1.0), ('awe', 0.5)]
# Weight 1.0 = same superclass, 0.5 = different superclass

How it works: Words with the same HHHH-LLLLL (superclass + synset ID) are TRUE synonyms - they come from the same WordNet synset.

Similarity

from Oyemi import semantic_similarity, word_distance, find_similar

# Similarity (0-1, higher = more similar)
sim = semantic_similarity("cat", "dog")

# Distance with details
dist, result = word_distance("cat", "dog")
print(result.shared_superclass)  # True
print(result.same_pos)           # True

# Find similar words from candidates
similar = find_similar("happy", ["sad", "joyful", "angry", "content"])
# [("joyful", 0.85), ("content", 0.72), ...]

Clustering

from Oyemi import cluster_by_superclass

words = ["dog", "cat", "run", "walk", "happy", "sad"]
clusters = cluster_by_superclass(words)
# {'0011': ['dog', 'cat'], '2002': ['run', 'walk'], ...}

Sentiment/Valence

from Oyemi import Encoder

enc = Encoder()

# Check word valence
for word in ["happy", "sad", "worried", "excellent", "terrible"]:
    parsed = enc.encode_parsed(word)
    if parsed:
        valence = parsed[0].valence_name
        print(f"{word}: {valence}")

# Output:
# happy: positive
# sad: negative
# worried: negative
# excellent: positive
# terrible: negative

SemanticCode Object

from Oyemi import Encoder

enc = Encoder()
codes = enc.encode_parsed("run")

for code in codes:
    print(code.raw)              # "2001-00042-2-1-0"
    print(code.superclass)       # "2001"
    print(code.synset_id)        # "00042"
    print(code.pos)              # 2
    print(code.pos_name)         # "verb"
    print(code.abstractness)     # 1
    print(code.abstractness_name)# "mixed"
    print(code.valence)          # 0
    print(code.valence_name)     # "neutral"

Exceptions

from Oyemi import (
    OyemiError,           # Base exception
    UnknownWordError,     # Word not in lexicon
    LexiconNotFoundError, # Database file missing
    InvalidCodeError,     # Malformed code string
)

try:
    codes = encode("xyznotaword")
except UnknownWordError as e:
    print(f"Unknown: {e.word}")

Building the Lexicon

The lexicon database is pre-built and included. To rebuild from WordNet:

# Install build dependencies
pip install oyemi[build]

# Download WordNet and SentiWordNet
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4'); nltk.download('sentiwordnet')"

# Build lexicon
python tools/build_lexicon.py

# Validate
python tools/validate_lexicon.py

Use Cases

  • Taxonomy Expansion - Expand keyword lists with true synonyms
  • Fast Text Similarity - Compare documents without embeddings
  • Sentiment Analysis - Quick valence detection
  • Semantic Clustering - Group words by meaning
  • Feature Engineering - Convert words to numeric features
  • Offline NLP - No API calls or model downloads
  • Deterministic Pipelines - Reproducible results

Example: Expand Sentiment Keywords

from Oyemi import Encoder

enc = Encoder()

# Original keywords
negative_words = ["fear", "worried", "anxious", "stressed"]

# Expand with synonyms
expanded = set(negative_words)
for word in negative_words:
    synonyms = enc.find_synonyms(word, limit=5)
    expanded.update(synonyms)

print(f"Expanded: {len(negative_words)} -> {len(expanded)} words")
# Expanded: 4 -> 20 words

Performance

Operation Time
Single lookup ~0.01ms
Batch (1000 words) ~5ms
Similarity ~0.02ms
Find synonyms ~0.1ms

Versioning

  • Codes never change once released (semantic stability)
  • New words get new codes in minor versions
  • Schema changes require major version bump

Author

Kaossara Osseni

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oyemi-3.0.0.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oyemi-3.0.0-py3-none-any.whl (8.8 MB view details)

Uploaded Python 3

File details

Details for the file oyemi-3.0.0.tar.gz.

File metadata

  • Download URL: oyemi-3.0.0.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for oyemi-3.0.0.tar.gz
Algorithm Hash digest
SHA256 111168339fbe313ee747040b42919df7b868ef74c55b827460352c0f23199779
MD5 39c6fe5291accead95f9598db0c0a406
BLAKE2b-256 04c17a361914d85cb727f814701a4d2c3fe3607f003c529f60462dc69296a4e8

See more details on using hashes here.

File details

Details for the file oyemi-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: oyemi-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for oyemi-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b848a6090cba2203c9c2a5ab4156ac4c7923d68452bbf84cd42532a9359ee7ec
MD5 df5a27c8346d50948879b078ee4b38c1
BLAKE2b-256 08a0cee8baa7ccdcfabe50715f9fed5d129fd12508c90d1d1cdebaf5620c021f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page