Skip to main content

Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.

Project description

Cire / Nyansasua

Cire (Hausa) — knowledge / wisdom. Nyansasua (Twi) — learning / wisdom.

PyPI package: nyansasua · C++ library: Cire

A self-contained, fast C++17 library for multi-language keyword extraction, with first-class Python bindings.

  • No external dependencies for the C++ core (no ICU, no Boost).
  • 18 languages with stopword lists: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Indonesian, Twi/Akan, Ga, Ewe, Hausa, Fante.
  • Tenant-aware stopword overlays for isolated domain/agent dictionaries.
  • BK-tree fuzzy snapping for tenant-scoped canonical term correction.
  • 4 algorithms (run any one, or combine via ensemble):
    • TF-IDF — single-doc entropy fallback or corpus-driven
    • YAKE — statistical (Campos et al., 2020)
    • TextRank — graph-based PageRank (Mihalcea & Tarau, 2004)
    • RAKE — rapid automatic keyword extraction (Rose et al., 2010)
  • UTF-8 everywhere — proper Unicode tokenizer with CJK / Cyrillic / Arabic / Hangul / Devanagari / Thai / Hiragana / Katakana support.
  • C++17 + clean public API; pybind11 Python module ships in python/.

Project layout

cire/
├── cpp/                  C++ core
│   ├── include/cire/     Public headers
│   ├── src/              Implementations
│   ├── examples/         Demo program (9 languages)
│   ├── tests/            C++ test suite
│   └── CMakeLists.txt
├── python/               pybind11 Python wrapper
│   ├── bindings.cpp
│   ├── cire/             Python package
│   └── tests/            pytest suite
├── CMakeLists.txt        Top-level build (optional)
├── pyproject.toml        Python packaging metadata
└── README.md

C++ quick start

#include <cire/extractor.hpp>
#include <cstdio>

int main() {
    std::string text = "Natural language processing (NLP) is a subfield of "
                       "linguistics, computer science, and artificial "
                       "intelligence concerned with the interactions between "
                       "computers and human language. Transformers have "
                       "revolutionized NLP.";

    cire::EnsembleConfig cfg;
    cfg.language = cire::Language::Auto;
    cfg.top_k = 5;

    for (const auto& k : cire::extract_keywords_ensemble(text, cfg)) {
        std::printf("%-20s  score=%.3f\n", k.text.c_str(), k.score);
    }
}

Build (C++ only)

cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./cire_tests
./cire_bench
./cire_demo           # multi-language demo

Python quick start

Install from PyPI

pip install nyansasua

Nyansasua installs as the cire module:

import cire

print(cire.__version__)

Install from source

cd Cire
pip install -e .

This uses scikit-build-core + pybind11 to compile the C++ core and produce a wheel that bundles the compiled extension. Once installed:

import cire

# One-liner
for k in cire.extract_keywords("Hello world", top_k=5):
    print(k.text, k.score)

# Or use the high-level Extractor class
ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)
for k in ext.extract("Machine learning is a branch of AI."):
    print(k.text, k.score)

# Batch processing
results = ext.extract_many([
    "Python is widely used in data science and machine learning.",
    "Climate change is one of the biggest challenges facing humanity.",
])

# Corpus-driven TF-IDF (uses document frequency across many texts)
corpus = ["Python is used in data science.", "Java is used in enterprise.",
          "Python is great for scripting.", "Java runs on the JVM."]
kws = ext.extract_corpus_tfidf(corpus, "Python is popular for ML and AI.",
                               top_k=5)

Tenant dictionaries for domains

Use tenant dictionaries when different November agents need isolated domain vocabulary in memory at the same time.

import cire

cire.load_tenant_dictionary(
    "education",
    ["mathematics", "english", "fractions", "lesson_note", "B2"],
)
cire.load_tenant_dictionary(
    "banking",
    ["mobile_money", "microloan", "GHS", "susu"],
)

print(cire.snap_term("education", "mathematic")) # mathematics
print(cire.snap_term("education", "fracions")) # fractions
print(cire.snap_term("banking", "micro-loan")) # microloan
print(cire.snap_term("education", "micro-loan")) # micro-loan, no banking leakage

Tenant stopwords

Tenant stopwords are isolated overlays on top of built-in language stopwords.

import cire

cire.load_tenant_stopwords(
    "health",
    cire.Language.English,
    ["please", "show", "patient", "case"],
)

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.tenant_id = "health"
cfg.top_k = 5

for kw in cire.extract_keywords("Please show malaria treatment for this patient case", cfg):
    print(kw.text, kw.score)

Education lesson-note query example

This mirrors a November Education agent that extracts expected filters first, uses an alias map for semantic aliases, and lets Nyansasua snap remaining spelling variants with the Education tenant dictionary.

import cire

cire.load_tenant_dictionary(
    "education",
    [
        "B2",
        "english",
        "lesson_note",
        "GES",
        "core_competencies",
        "assessment_task",
    ],
)

aliases = {
    "basic 2": "B2",
    "english language": "english",
}

entities = {
    "grade": "Basic 2",
    "subject": "Englsh",
}

normalized = {}
for field, value in entities.items():
    key = value.lower()
    normalized[field] = aliases.get(key) or cire.snap_term("education", key, 2)

print(normalized)
# {'grade': 'B2', 'subject': 'english'}

Ghanaian language detection

import cire

samples = {
    "ewe": "ame ƒe nu",
    "hausa": "ɗan makaranta yana karatu",
    "ga": "ŋɔɔ kɛ sane",
    "fante": "me dɛ hom nyina",
}

for label, text in samples.items():
    lang = cire.detect_language(text)
    print(label, cire.language_name(lang), cire.language_code(lang))

Build the Python module directly (no scikit-build)

cd cpp
mkdir build && cd build
cmake .. -DCIRE_BUILD_PYTHON=ON -Dpybind11_DIR=$(python3 -m pybind11 --cmakedir)
cmake --build . -j
# The .so is dropped into python/cire/ by the top-level CMake hook.

Run the Python tests

cd python
pytest tests/

API surface

C++

Header What it does
cire/types.hpp Language, Token, Keyword, Sentence
cire/tokenizer.hpp UTF-8 tokenizer + sentence splitter
cire/stopwords.hpp Stopword lists + tenant overlays
cire/snapper.hpp Tenant fuzzy dictionary snapping
cire/tfidf.hpp TF-IDF extractor (corpus + single-doc fallback)
cire/yake.hpp YAKE statistical extractor
cire/textrank.hpp TextRank PageRank extractor
cire/rake.hpp RAKE phrase extractor
cire/extractor.hpp Top-level facade + ensemble + language detect

Python (import cire)

Symbol What it does
Extractor(language, algorithm, …) High-level facade class
Language, Algorithm Enums (with string aliases)
ExtractConfig, EnsembleConfig Per-call configuration
extract_keywords(text, config) Run one algorithm
extract_keywords_ensemble(…) Run all four, merge
tokenize, split_sentences Low-level token utilities
is_stopword, add_stopword Stopword inspection
load_tenant_stopwords Tenant-specific stopword overlays
load_tenant_dictionary, snap_term Tenant fuzzy dictionary snapping
detect_language Heuristic script detection
build_corpus_df Build a DF table for TF-IDF

Algorithm selection

Use case Pick
Large corpus, need IDF signal TFIDF
Single document, no corpus YAKE
Want graph-based co-occurrence ranking TextRank
Domain phrases (e.g. legal, medical) RAKE
Best of all worlds ensemble

Multi-language behavior

  • The tokenizer handles mixed-script text (e.g. "Hello世界world") correctly.
  • For CJK / Korean / Japanese, the stopword list contains function words; the tokenizer also splits every CJK char into its own 1-char token, which is the standard approach for these scripts.
  • The casing salience feature in YAKE is automatically a no-op for caseless scripts (CJK, Hangul, Arabic).
  • Use cire.detect_language(text) if you don't want to specify the language up front.
  • Ghanaian language detection is heuristic and works best when native Unicode characters such as ƒ, ʋ, ɗ, ɓ, ƙ, ŋ, ɛ, and ɔ are preserved.

Distribution

The Python package is built with scikit-build-core; for cross-platform wheels, configure cibuildwheel:

[tool.cibuildwheel]
build = ["cp38-*", "cp39-*", "cp310-*", "cp311-*", "cp312-*"]

Then:

python -m cibuildwheel --output-dir wheelhouse
twine upload wheelhouse/*

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nyansasua-0.2.4.tar.gz (90.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl (255.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file nyansasua-0.2.4.tar.gz.

File metadata

  • Download URL: nyansasua-0.2.4.tar.gz
  • Upload date:
  • Size: 90.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for nyansasua-0.2.4.tar.gz
Algorithm Hash digest
SHA256 30bd5e7b34f3fb4633ec076a870b1212a86943f0ce9daf4bb577dc8dd6be27fc
MD5 981090f7a2682e0f8654e55349b37ac8
BLAKE2b-256 a850961b0dc4956bd37573f5ca7198f22e5cdda24c64fda6e672452a4ede03c1

See more details on using hashes here.

File details

Details for the file nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d77bff64ec7bdd22fa22f96cabf37021007affd13f0a763568ce3653f0315e19
MD5 8ca55894eb4673cf0164a3ce96563b30
BLAKE2b-256 be8c289b7788b41377ebc569047fb0fd14c035afbf83e46d99114c2515dfdd60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page