Offline semantic numeric lexicon - deterministic word-to-code mapping
Project description
Oyemi
Offline Semantic Numeric Lexicon
Oyemi is a deterministic, high-performance semantic encoding library. It maps words to numeric codes that capture semantic meaning, enabling fast similarity calculations and synonym discovery without neural networks.
Key Features
- Zero Runtime Dependencies - No WordNet, NLTK, or ML models needed at runtime
- Deterministic Codes - Same word always produces same codes
- Fast Lookups - SQLite with memory mapping (~0.01ms per lookup)
- TRUE Synonym Finder - Find synonyms using WordNet synset matching
- Semantic Distance - Calculate word similarity using codes
- Sentiment/Valence - Built-in positive/negative classification (SentiWordNet)
- Lemma Fallback - Automatically handles word variations (running -> run)
- Polysemy Support - Multiple codes for words with multiple meanings
Installation
pip install oyemi
Quick Start
from Oyemi import encode, semantic_similarity, find_synonyms, Encoder
# Simple encoding
codes = encode("happy")
# ['0122-00042-3-2-1']
# Check similarity
sim = semantic_similarity("happy", "joyful")
# 0.85
# Find TRUE synonyms
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']
# Full encoder with parsed codes
enc = Encoder()
parsed = enc.encode_parsed("worried")
for code in parsed:
print(f"{code.raw}: {code.pos_name}, {code.valence_name}")
# 3999-04518-3-1-2: adjective, negative
Code Format
Codes follow the format: HHHH-LLLLL-P-A-V
| Component | Meaning | Values |
|---|---|---|
| HHHH | Semantic superclass | 0001-9999 (100+ categories) |
| LLLLL | Synset ID | 00001-99999 |
| P | Part of speech | 1=noun, 2=verb, 3=adj, 4=adv |
| A | Abstractness | 0=concrete, 1=mixed, 2=abstract |
| V | Valence | 0=neutral, 1=positive, 2=negative |
Superclass Categories
Oyemi includes 100+ semantic categories for precise classification:
| Range | Domain | Examples |
|---|---|---|
| 0100-0199 | Emotions | fear, joy, anger, sadness |
| 0200-0299 | Work/Business | job, salary, manager, career |
| 0300-0399 | Communication | speak, write, message |
| 0400-0499 | Cognition | think, know, believe |
| 0500-0599 | Social | family, friend, group |
| 1000-1999 | Physical/Concrete | object, place, body |
| 2000-2999 | Actions | move, create, change |
| 3000-3999 | Properties | size, color, quality |
API Reference
Encoding
from Oyemi import encode, Encoder
# Simple function
codes = encode("word") # Returns List[str]
# Full encoder
enc = Encoder()
codes = enc.encode("word") # List[str]
parsed = enc.encode_parsed("word") # List[SemanticCode]
primary = enc.get_primary_code("word") # str
exists = enc.contains("word") # bool
batch = enc.encode_batch(["a", "b"]) # Dict[str, List[str]]
Synonym Discovery
from Oyemi import find_synonyms, Encoder
# Simple usage
synonyms = find_synonyms("fear")
# ['awe', 'dread', 'fright', 'concern']
# With filters (default: all enabled)
enc = Encoder()
synonyms = enc.find_synonyms(
"fear",
limit=10,
pos_lock=True, # Only same part-of-speech
abstractness_lock=True, # Don't mix abstract/concrete
return_weighted=False # Return list of words
)
# Get weighted synonyms (for ranking)
weighted = enc.find_synonyms("fear", return_weighted=True)
# [('dread', 1.0), ('fright', 1.0), ('awe', 0.5)]
# Weight 1.0 = same superclass, 0.5 = different superclass
How it works: Words with the same HHHH-LLLLL (superclass + synset ID) are TRUE synonyms - they come from the same WordNet synset.
Similarity
from Oyemi import semantic_similarity, word_distance, find_similar
# Similarity (0-1, higher = more similar)
sim = semantic_similarity("cat", "dog")
# Distance with details
dist, result = word_distance("cat", "dog")
print(result.shared_superclass) # True
print(result.same_pos) # True
# Find similar words from candidates
similar = find_similar("happy", ["sad", "joyful", "angry", "content"])
# [("joyful", 0.85), ("content", 0.72), ...]
Clustering
from Oyemi import cluster_by_superclass
words = ["dog", "cat", "run", "walk", "happy", "sad"]
clusters = cluster_by_superclass(words)
# {'0011': ['dog', 'cat'], '2002': ['run', 'walk'], ...}
Sentiment/Valence
from Oyemi import Encoder
enc = Encoder()
# Check word valence
for word in ["happy", "sad", "worried", "excellent", "terrible"]:
parsed = enc.encode_parsed(word)
if parsed:
valence = parsed[0].valence_name
print(f"{word}: {valence}")
# Output:
# happy: positive
# sad: negative
# worried: negative
# excellent: positive
# terrible: negative
SemanticCode Object
from Oyemi import Encoder
enc = Encoder()
codes = enc.encode_parsed("run")
for code in codes:
print(code.raw) # "2001-00042-2-1-0"
print(code.superclass) # "2001"
print(code.synset_id) # "00042"
print(code.pos) # 2
print(code.pos_name) # "verb"
print(code.abstractness) # 1
print(code.abstractness_name)# "mixed"
print(code.valence) # 0
print(code.valence_name) # "neutral"
Exceptions
from Oyemi import (
OyemiError, # Base exception
UnknownWordError, # Word not in lexicon
LexiconNotFoundError, # Database file missing
InvalidCodeError, # Malformed code string
)
try:
codes = encode("xyznotaword")
except UnknownWordError as e:
print(f"Unknown: {e.word}")
Building the Lexicon
The lexicon database is pre-built and included. To rebuild from WordNet:
# Install build dependencies
pip install oyemi[build]
# Download WordNet and SentiWordNet
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4'); nltk.download('sentiwordnet')"
# Build lexicon
python tools/build_lexicon.py
# Validate
python tools/validate_lexicon.py
Use Cases
- Taxonomy Expansion - Expand keyword lists with true synonyms
- Fast Text Similarity - Compare documents without embeddings
- Sentiment Analysis - Quick valence detection
- Semantic Clustering - Group words by meaning
- Feature Engineering - Convert words to numeric features
- Offline NLP - No API calls or model downloads
- Deterministic Pipelines - Reproducible results
Example: Expand Sentiment Keywords
from Oyemi import Encoder
enc = Encoder()
# Original keywords
negative_words = ["fear", "worried", "anxious", "stressed"]
# Expand with synonyms
expanded = set(negative_words)
for word in negative_words:
synonyms = enc.find_synonyms(word, limit=5)
expanded.update(synonyms)
print(f"Expanded: {len(negative_words)} -> {len(expanded)} words")
# Expanded: 4 -> 20 words
Performance
| Operation | Time |
|---|---|
| Single lookup | ~0.01ms |
| Batch (1000 words) | ~5ms |
| Similarity | ~0.02ms |
| Find synonyms | ~0.1ms |
Versioning
- Codes never change once released (semantic stability)
- New words get new codes in minor versions
- Schema changes require major version bump
Author
Kaossara Osseni
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oyemi-2.0.0.tar.gz.
File metadata
- Download URL: oyemi-2.0.0.tar.gz
- Upload date:
- Size: 8.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f27c604b41e4dd8763c06f61c238ee1398bb7b06068744d5b2898f53cb7a583
|
|
| MD5 |
92e44a32a5b597032d3a9f680fe02783
|
|
| BLAKE2b-256 |
73cdc79c75750d79a17c438d0a03347c0137f0b89f3e2d28e0b703745d7f6e36
|
File details
Details for the file oyemi-2.0.0-py3-none-any.whl.
File metadata
- Download URL: oyemi-2.0.0-py3-none-any.whl
- Upload date:
- Size: 8.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b39b438919f803023ce923162766e609f9aed0189162541e3cc33179e2ea58c
|
|
| MD5 |
84be261fc05d8ecfc4b7c5ec85d596d6
|
|
| BLAKE2b-256 |
8ba730881e75236324da6fc536b116053ca5f62bfbb84b7efc099b8c1ee3ca3e
|