Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.
Project description
Nyansasua
Fast multi-language keyword extraction for Python, powered by the C++17 Cire core.
Nyansasua installs as the cire Python module and provides TF-IDF, YAKE, TextRank,
RAKE, and ensemble keyword extraction with UTF-8 tokenization, stopword filtering,
tenant-aware configuration, and fuzzy dictionary snapping.
Features
- 18 language profiles: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Indonesian, Twi/Akan, Ga, Ewe, Hausa, and Fante.
- 4 extraction algorithms: TF-IDF, YAKE, TextRank, RAKE, plus ensemble mode.
- Tenant-aware stopwords: isolate domain or agent-specific stopwords such as Banking, Health, Legal, and Education.
- BK-tree fuzzy snapping: fast tenant-scoped correction to canonical terms like
NHIS,GHS, or domain vocabulary. - Unicode-native: handles UTF-8 text, Ghanaian characters, CJK, Cyrillic, Arabic, Hangul, Hiragana, Katakana, Thai, and Devanagari scripts.
- No Python runtime dependencies after installation.
Installation
pip install nyansasua
Quick Start
import cire
cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.top_k = 5
for kw in cire.extract_keywords("Machine learning is a branch of AI.", cfg):
print(kw.text, kw.score)
High-Level Extractor
import cire
ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)
keywords = ext.extract(
"Natural language processing has seen rapid growth in education tools."
)
for kw in keywords:
print(kw.text, kw.score)
Ghanaian Language Detection
import cire
print(cire.detect_language("ame ƒe nu")) # Language.Ewe
print(cire.detect_language("ɗan makaranta")) # Language.Hausa
print(cire.detect_language("ŋɔɔ kɛ sane")) # Language.Ga
print(cire.detect_language("me dɛ hom nyina")) # Language.Fante
Detection is heuristic. Text with diagnostic Unicode characters such as ƒ, ʋ,
ɗ, ɓ, ƙ, ŋ, ɛ, and ɔ is much more reliable than plain ASCII text.
Tenant-Aware Stopwords
Use tenant IDs to keep domain-specific stopwords isolated across agents.
import cire
cire.load_tenant_stopwords(
"banking",
cire.Language.English,
["can", "get", "account", "fees"],
)
cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.RAKE
cfg.tenant_id = "banking"
cfg.top_k = 5
keywords = cire.extract_keywords(
"Can I get account fees for a mobile money loan?",
cfg,
)
Tenant stopwords are additive: built-in language stopwords still apply, and each tenant gets its own isolated overlay.
Tenant Fuzzy Dictionary Snapping
Nyansasua can keep separate canonical dictionaries in memory for different tenants or domains.
import cire
cire.load_tenant_dictionary("health", ["NHIS", "GHS", "malaria treatment"])
print(cire.snap_term("health", "nhsi")) # NHIS
print(cire.snap_term("legal", "nhsi")) # nhsi, no cross-tenant leakage
The snapper uses a BK-tree per tenant, so large dictionaries avoid a full linear scan for every query.
Batch Processing And Corpus TF-IDF
import cire
ext = cire.Extractor(language="english", algorithm="ensemble", top_k=5)
batch = ext.extract_many([
"Python is widely used in data science.",
"Climate change is a significant global challenge.",
])
corpus = [
"Python is used in data science.",
"Java is used in enterprise environments.",
"Python is popular for AI.",
]
kws = ext.extract_corpus_tfidf(
texts=corpus,
target_text="Python is heavily used in AI and ML.",
top_k=3,
)
Performance Snapshot
Recent C++ benchmark run on the development server:
- Stopword lookups: about 0.16-0.63 microseconds per lookup.
- YAKE short text extraction: about 16.6 microseconds per extraction.
- BK-tree fuzzy snapping at 10,000 terms: about 243 microseconds per snap.
- Concurrent tenant stopword isolation: 0 failures across 160,000 operations.
Exact timings depend on hardware, compiler, build type, and input shape.
License
MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nyansasua-0.2.2.tar.gz.
File metadata
- Download URL: nyansasua-0.2.2.tar.gz
- Upload date:
- Size: 20.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27484a2a049fa1432149ab3fab0306493362c53ff6ffbf3c6d9dda773bb6e5a4
|
|
| MD5 |
67d0559f394528396ceede6c394154f8
|
|
| BLAKE2b-256 |
872b647fab7cffc7c3bf5e9dfa79a4c771b74048b90dde5dc33ffc4be7ddc16b
|
File details
Details for the file nyansasua-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: nyansasua-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 252.8 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a47370db16549d63351df289af9bf0b8fdb7ffd0baafc07ccdc43074a26af82
|
|
| MD5 |
3b409713ca87d3ce82a85afbc47e2a60
|
|
| BLAKE2b-256 |
f3f04b37ff0a8c4ecb0791b48bf5318639dd177040d68ed8052fb5e23df2cfe9
|