Skip to main content

Offline semantic numeric lexicon - deterministic word-to-code mapping

Project description

Oyemi

Oyemi Logo

Offline Semantic Numeric Lexicon

License PyPI Python


Oyemi is a deterministic, high-performance semantic encoding library. It maps words to numeric codes that capture semantic meaning, enabling fast similarity calculations and synonym discovery without neural networks.

Key Features

  • Zero Runtime Dependencies - No WordNet, NLTK, or ML models needed at runtime
  • Deterministic Codes - Same word always produces same codes
  • Fast Lookups - SQLite with memory mapping (~0.01ms per lookup)
  • TRUE Synonym Finder - Find synonyms using WordNet synset matching
  • Semantic Distance - Calculate word similarity using codes
  • Sentiment/Valence - Built-in positive/negative classification (SentiWordNet)
  • Lemma Fallback - Automatically handles word variations (running -> run)
  • Polysemy Support - Multiple codes for words with multiple meanings

Installation

pip install oyemi

Quick Start

from Oyemi import encode, semantic_similarity, find_synonyms, Encoder

# Simple encoding
codes = encode("happy")
# ['0122-00042-3-2-1']

# Check similarity
sim = semantic_similarity("happy", "joyful")
# 0.85

# Find TRUE synonyms
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']

# Full encoder with parsed codes
enc = Encoder()
parsed = enc.encode_parsed("worried")
for code in parsed:
    print(f"{code.raw}: {code.pos_name}, {code.valence_name}")
    # 3999-04518-3-1-2: adjective, negative

Code Format

Codes follow the format: HHHH-LLLLL-P-A-V

Component Meaning Values
HHHH Semantic superclass 0001-9999 (100+ categories)
LLLLL Synset ID 00001-99999
P Part of speech 1=noun, 2=verb, 3=adj, 4=adv
A Abstractness 0=concrete, 1=mixed, 2=abstract
V Valence 0=neutral, 1=positive, 2=negative

Superclass Categories

Oyemi includes 100+ semantic categories for precise classification:

Range Domain Examples
0100-0199 Emotions fear, joy, anger, sadness
0200-0299 Work/Business job, salary, manager, career
0300-0399 Communication speak, write, message
0400-0499 Cognition think, know, believe
0500-0599 Social family, friend, group
1000-1999 Physical/Concrete object, place, body
2000-2999 Actions move, create, change
3000-3999 Properties size, color, quality

API Reference

Encoding

from Oyemi import encode, Encoder

# Simple function
codes = encode("word")  # Returns List[str]

# Full encoder
enc = Encoder()
codes = enc.encode("word")              # List[str]
parsed = enc.encode_parsed("word")      # List[SemanticCode]
primary = enc.get_primary_code("word")  # str
exists = enc.contains("word")           # bool
batch = enc.encode_batch(["a", "b"])    # Dict[str, List[str]]

Synonym Discovery

from Oyemi import find_synonyms, Encoder

# Simple usage
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']

# With filters (default: all enabled)
enc = Encoder()
synonyms = enc.find_synonyms(
    "fear",
    limit=10,
    pos_lock=True,           # Only same part-of-speech
    abstractness_lock=True,  # Don't mix abstract/concrete
    return_weighted=False    # Return list of words
)

# Get weighted synonyms (for ranking)
weighted = enc.find_synonyms("fear", return_weighted=True)
# [('dread', 1.0), ('fright', 1.0), ('awe', 0.5)]
# Weight 1.0 = same superclass, 0.5 = different superclass

How it works: Words with the same HHHH-LLLLL (superclass + synset ID) are TRUE synonyms - they come from the same WordNet synset.

Similarity

from Oyemi import semantic_similarity, word_distance, find_similar

# Similarity (0-1, higher = more similar)
sim = semantic_similarity("cat", "dog")

# Distance with details
dist, result = word_distance("cat", "dog")
print(result.shared_superclass)  # True
print(result.same_pos)           # True

# Find similar words from candidates
similar = find_similar("happy", ["sad", "joyful", "angry", "content"])
# [("joyful", 0.85), ("content", 0.72), ...]

Clustering

from Oyemi import cluster_by_superclass

words = ["dog", "cat", "run", "walk", "happy", "sad"]
clusters = cluster_by_superclass(words)
# {'0011': ['dog', 'cat'], '2002': ['run', 'walk'], ...}

Sentiment/Valence

from Oyemi import Encoder

enc = Encoder()

# Check word valence
for word in ["happy", "sad", "worried", "excellent", "terrible"]:
    parsed = enc.encode_parsed(word)
    if parsed:
        valence = parsed[0].valence_name
        print(f"{word}: {valence}")

# Output:
# happy: positive
# sad: negative
# worried: negative
# excellent: positive
# terrible: negative

SemanticCode Object

from Oyemi import Encoder

enc = Encoder()
codes = enc.encode_parsed("run")

for code in codes:
    print(code.raw)              # "2001-00042-2-1-0"
    print(code.superclass)       # "2001"
    print(code.synset_id)        # "00042"
    print(code.pos)              # 2
    print(code.pos_name)         # "verb"
    print(code.abstractness)     # 1
    print(code.abstractness_name)# "mixed"
    print(code.valence)          # 0
    print(code.valence_name)     # "neutral"

Exceptions

from Oyemi import (
    OyemiError,           # Base exception
    UnknownWordError,     # Word not in lexicon
    LexiconNotFoundError, # Database file missing
    InvalidCodeError,     # Malformed code string
)

try:
    codes = encode("xyznotaword")
except UnknownWordError as e:
    print(f"Unknown: {e.word}")

Building the Lexicon

The lexicon database is pre-built and included. To rebuild from WordNet:

# Install build dependencies
pip install oyemi[build]

# Download WordNet and SentiWordNet
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4'); nltk.download('sentiwordnet')"

# Build lexicon
python tools/build_lexicon.py

# Validate
python tools/validate_lexicon.py

Use Cases

  • Taxonomy Expansion - Expand keyword lists with true synonyms
  • Fast Text Similarity - Compare documents without embeddings
  • Sentiment Analysis - Quick valence detection
  • Semantic Clustering - Group words by meaning
  • Feature Engineering - Convert words to numeric features
  • Offline NLP - No API calls or model downloads
  • Deterministic Pipelines - Reproducible results

Example: Expand Sentiment Keywords

from Oyemi import Encoder

enc = Encoder()

# Original keywords
negative_words = ["fear", "worried", "anxious", "stressed"]

# Expand with synonyms
expanded = set(negative_words)
for word in negative_words:
    synonyms = enc.find_synonyms(word, limit=5)
    expanded.update(synonyms)

print(f"Expanded: {len(negative_words)} -> {len(expanded)} words")
# Expanded: 4 -> 20 words

Performance

Operation Time
Single lookup ~0.01ms
Batch (1000 words) ~5ms
Similarity ~0.02ms
Find synonyms ~0.1ms

Versioning

  • Codes never change once released (semantic stability)
  • New words get new codes in minor versions
  • Schema changes require major version bump

Author

Kaossara Osseni

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oyemi-2.0.0.tar.gz (8.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oyemi-2.0.0-py3-none-any.whl (8.4 MB view details)

Uploaded Python 3

File details

Details for the file oyemi-2.0.0.tar.gz.

File metadata

  • Download URL: oyemi-2.0.0.tar.gz
  • Upload date:
  • Size: 8.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for oyemi-2.0.0.tar.gz
Algorithm Hash digest
SHA256 2f27c604b41e4dd8763c06f61c238ee1398bb7b06068744d5b2898f53cb7a583
MD5 92e44a32a5b597032d3a9f680fe02783
BLAKE2b-256 73cdc79c75750d79a17c438d0a03347c0137f0b89f3e2d28e0b703745d7f6e36

See more details on using hashes here.

File details

Details for the file oyemi-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: oyemi-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for oyemi-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8b39b438919f803023ce923162766e609f9aed0189162541e3cc33179e2ea58c
MD5 84be261fc05d8ecfc4b7c5ec85d596d6
BLAKE2b-256 8ba730881e75236324da6fc536b116053ca5f62bfbb84b7efc099b8c1ee3ca3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page