Skip to main content

Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.

Project description

Nyansasua

Fast multi-language keyword extraction for Python, powered by the C++17 Cire core.

Nyansasua installs as the cire Python module and provides TF-IDF, YAKE, TextRank, RAKE, and ensemble keyword extraction with UTF-8 tokenization, stopword filtering, tenant-aware configuration, and fuzzy dictionary snapping.

Features

  • 18 language profiles: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Indonesian, Twi/Akan, Ga, Ewe, Hausa, and Fante.
  • 4 extraction algorithms: TF-IDF, YAKE, TextRank, RAKE, plus ensemble mode.
  • Tenant-aware stopwords: isolate domain or agent-specific stopwords such as Banking, Health, Legal, and Education.
  • BK-tree fuzzy snapping: fast tenant-scoped correction to canonical terms like NHIS, GHS, or domain vocabulary.
  • Unicode-native: handles UTF-8 text, Ghanaian characters, CJK, Cyrillic, Arabic, Hangul, Hiragana, Katakana, Thai, and Devanagari scripts.
  • No Python runtime dependencies after installation.

Installation

pip install nyansasua

Quick Start

import cire

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.top_k = 5

for kw in cire.extract_keywords("Machine learning is a branch of AI.", cfg):
    print(kw.text, kw.score)

High-Level Extractor

import cire

ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)

keywords = ext.extract(
    "Natural language processing has seen rapid growth in education tools."
)

for kw in keywords:
    print(kw.text, kw.score)

Ghanaian Language Detection

import cire

print(cire.detect_language("ame ƒe nu"))        # Language.Ewe
print(cire.detect_language("ɗan makaranta"))    # Language.Hausa
print(cire.detect_language("ŋɔɔ kɛ sane"))      # Language.Ga
print(cire.detect_language("me dɛ hom nyina"))  # Language.Fante

Detection is heuristic. Text with diagnostic Unicode characters such as ƒ, ʋ, ɗ, ɓ, ƙ, ŋ, ɛ, and ɔ is much more reliable than plain ASCII text.

Tenant-Aware Stopwords

Use tenant IDs to keep domain-specific stopwords isolated across agents.

import cire

cire.load_tenant_stopwords(
    "banking",
    cire.Language.English,
    ["can", "get", "account", "fees"],
)

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.RAKE
cfg.tenant_id = "banking"
cfg.top_k = 5

keywords = cire.extract_keywords(
    "Can I get account fees for a mobile money loan?",
    cfg,
)

Tenant stopwords are additive: built-in language stopwords still apply, and each tenant gets its own isolated overlay.

Tenant Fuzzy Dictionary Snapping

Nyansasua can keep separate canonical dictionaries in memory for different tenants or domains.

import cire

cire.load_tenant_dictionary("health", ["NHIS", "GHS", "malaria treatment"])

print(cire.snap_term("health", "nhsi"))  # NHIS
print(cire.snap_term("legal", "nhsi"))   # nhsi, no cross-tenant leakage

The snapper uses a BK-tree per tenant, so large dictionaries avoid a full linear scan for every query.

Batch Processing And Corpus TF-IDF

import cire

ext = cire.Extractor(language="english", algorithm="ensemble", top_k=5)

batch = ext.extract_many([
    "Python is widely used in data science.",
    "Climate change is a significant global challenge.",
])

corpus = [
    "Python is used in data science.",
    "Java is used in enterprise environments.",
    "Python is popular for AI.",
]

kws = ext.extract_corpus_tfidf(
    texts=corpus,
    target_text="Python is heavily used in AI and ML.",
    top_k=3,
)

Performance Snapshot

Recent C++ benchmark run on the development server:

  • Stopword lookups: about 0.16-0.63 microseconds per lookup.
  • YAKE short text extraction: about 16.6 microseconds per extraction.
  • BK-tree fuzzy snapping at 10,000 terms: about 243 microseconds per snap.
  • Concurrent tenant stopword isolation: 0 failures across 160,000 operations.

Exact timings depend on hardware, compiler, build type, and input shape.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nyansasua-0.2.3.tar.gz (763.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nyansasua-0.2.3-cp312-cp312-manylinux_2_34_x86_64.whl (252.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file nyansasua-0.2.3.tar.gz.

File metadata

  • Download URL: nyansasua-0.2.3.tar.gz
  • Upload date:
  • Size: 763.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for nyansasua-0.2.3.tar.gz
Algorithm Hash digest
SHA256 fd6972fa706179d1be983ce1d5477b2b69b23541c288557dfe3f73fc138089d6
MD5 4c44384720711ae90ed1c6cdac55aeab
BLAKE2b-256 0add2887db72952a0383298bd3b903a24443763741f7386e2aa01d8cfb0eef55

See more details on using hashes here.

File details

Details for the file nyansasua-0.2.3-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for nyansasua-0.2.3-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 66aab86c64ba59a3d2a497137764103782af6ee21decc4c5f8733af79b91edfa
MD5 7183586a8abfa05a94052aa0addbc90b
BLAKE2b-256 7fd07509ab0843001c608c1ea012812ee0c0b1c598849ddba21c7da213581a9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page