Skip to main content

Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.

Project description

Nyansasua

Fast multi-language keyword extraction for Python, powered by the C++17 Cire core.

Nyansasua installs as the cire Python module and provides TF-IDF, YAKE, TextRank, RAKE, and ensemble keyword extraction with UTF-8 tokenization, stopword filtering, tenant-aware configuration, and fuzzy dictionary snapping.

Features

  • 18 language profiles: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Indonesian, Twi/Akan, Ga, Ewe, Hausa, and Fante.
  • 4 extraction algorithms: TF-IDF, YAKE, TextRank, RAKE, plus ensemble mode.
  • Tenant-aware stopwords: isolate domain or agent-specific stopwords such as Banking, Health, Legal, and Education.
  • BK-tree fuzzy snapping: fast tenant-scoped correction to canonical terms like NHIS, GHS, or domain vocabulary.
  • Unicode-native: handles UTF-8 text, Ghanaian characters, CJK, Cyrillic, Arabic, Hangul, Hiragana, Katakana, Thai, and Devanagari scripts.
  • No Python runtime dependencies after installation.

Installation

pip install nyansasua

Quick Start

import cire

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.top_k = 5

for kw in cire.extract_keywords("Machine learning is a branch of AI.", cfg):
    print(kw.text, kw.score)

High-Level Extractor

import cire

ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)

keywords = ext.extract(
    "Natural language processing has seen rapid growth in education tools."
)

for kw in keywords:
    print(kw.text, kw.score)

Ghanaian Language Detection

import cire

print(cire.detect_language("ame ƒe nu"))        # Language.Ewe
print(cire.detect_language("ɗan makaranta"))    # Language.Hausa
print(cire.detect_language("ŋɔɔ kɛ sane"))      # Language.Ga
print(cire.detect_language("me dɛ hom nyina"))  # Language.Fante

Detection is heuristic. Text with diagnostic Unicode characters such as ƒ, ʋ, ɗ, ɓ, ƙ, ŋ, ɛ, and ɔ is much more reliable than plain ASCII text.

Tenant-Aware Stopwords

Use tenant IDs to keep domain-specific stopwords isolated across agents.

import cire

cire.load_tenant_stopwords(
    "banking",
    cire.Language.English,
    ["can", "get", "account", "fees"],
)

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.RAKE
cfg.tenant_id = "banking"
cfg.top_k = 5

keywords = cire.extract_keywords(
    "Can I get account fees for a mobile money loan?",
    cfg,
)

Tenant stopwords are additive: built-in language stopwords still apply, and each tenant gets its own isolated overlay.

Tenant Fuzzy Dictionary Snapping

Nyansasua can keep separate canonical dictionaries in memory for different tenants or domains.

import cire

cire.load_tenant_dictionary("health", ["NHIS", "GHS", "malaria treatment"])

print(cire.snap_term("health", "nhsi"))  # NHIS
print(cire.snap_term("legal", "nhsi"))   # nhsi, no cross-tenant leakage

The snapper uses a BK-tree per tenant, so large dictionaries avoid a full linear scan for every query.

Batch Processing And Corpus TF-IDF

import cire

ext = cire.Extractor(language="english", algorithm="ensemble", top_k=5)

batch = ext.extract_many([
    "Python is widely used in data science.",
    "Climate change is a significant global challenge.",
])

corpus = [
    "Python is used in data science.",
    "Java is used in enterprise environments.",
    "Python is popular for AI.",
]

kws = ext.extract_corpus_tfidf(
    texts=corpus,
    target_text="Python is heavily used in AI and ML.",
    top_k=3,
)

Performance Snapshot

Recent C++ benchmark run on the development server:

  • Stopword lookups: about 0.16-0.63 microseconds per lookup.
  • YAKE short text extraction: about 16.6 microseconds per extraction.
  • BK-tree fuzzy snapping at 10,000 terms: about 243 microseconds per snap.
  • Concurrent tenant stopword isolation: 0 failures across 160,000 operations.

Exact timings depend on hardware, compiler, build type, and input shape.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nyansasua-0.2.2.tar.gz (20.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nyansasua-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl (252.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file nyansasua-0.2.2.tar.gz.

File metadata

  • Download URL: nyansasua-0.2.2.tar.gz
  • Upload date:
  • Size: 20.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for nyansasua-0.2.2.tar.gz
Algorithm Hash digest
SHA256 27484a2a049fa1432149ab3fab0306493362c53ff6ffbf3c6d9dda773bb6e5a4
MD5 67d0559f394528396ceede6c394154f8
BLAKE2b-256 872b647fab7cffc7c3bf5e9dfa79a4c771b74048b90dde5dc33ffc4be7ddc16b

See more details on using hashes here.

File details

Details for the file nyansasua-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for nyansasua-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 0a47370db16549d63351df289af9bf0b8fdb7ffd0baafc07ccdc43074a26af82
MD5 3b409713ca87d3ce82a85afbc47e2a60
BLAKE2b-256 f3f04b37ff0a8c4ecb0791b48bf5318639dd177040d68ed8052fb5e23df2cfe9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page