Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.

These details have not been verified by PyPI

Project links

Project description

Cire / Nyansasua

Cire (Hausa) — knowledge / wisdom. Nyansasua (Twi) — learning / wisdom.

PyPI package: nyansasua · C++ library: Cire

A self-contained, fast C++17 library for multi-language keyword extraction, with first-class Python bindings.

No external dependencies for the C++ core (no ICU, no Boost).
18 languages with stopword lists: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Indonesian, Twi/Akan, Ga, Ewe, Hausa, Fante.
Tenant-aware stopword overlays for isolated domain/agent dictionaries.
BK-tree fuzzy snapping for tenant-scoped canonical term correction.
4 algorithms (run any one, or combine via ensemble):
- TF-IDF — single-doc entropy fallback or corpus-driven
- YAKE — statistical (Campos et al., 2020)
- TextRank — graph-based PageRank (Mihalcea & Tarau, 2004)
- RAKE — rapid automatic keyword extraction (Rose et al., 2010)
UTF-8 everywhere — proper Unicode tokenizer with CJK / Cyrillic / Arabic / Hangul / Devanagari / Thai / Hiragana / Katakana support.
C++17 + clean public API; pybind11 Python module ships in python/.

Project layout

cire/
├── cpp/                  C++ core
│   ├── include/cire/     Public headers
│   ├── src/              Implementations
│   ├── examples/         Demo program (9 languages)
│   ├── tests/            C++ test suite
│   └── CMakeLists.txt
├── python/               pybind11 Python wrapper
│   ├── bindings.cpp
│   ├── cire/             Python package
│   └── tests/            pytest suite
├── CMakeLists.txt        Top-level build (optional)
├── pyproject.toml        Python packaging metadata
└── README.md

C++ quick start

#include <cire/extractor.hpp>
#include <cstdio>

int main() {
    std::string text = "Natural language processing (NLP) is a subfield of "
                       "linguistics, computer science, and artificial "
                       "intelligence concerned with the interactions between "
                       "computers and human language. Transformers have "
                       "revolutionized NLP.";

    cire::EnsembleConfig cfg;
    cfg.language = cire::Language::Auto;
    cfg.top_k = 5;

    for (const auto& k : cire::extract_keywords_ensemble(text, cfg)) {
        std::printf("%-20s  score=%.3f\n", k.text.c_str(), k.score);
    }
}

Build (C++ only)

cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./cire_tests
./cire_bench
./cire_demo           # multi-language demo

Python quick start

Install from PyPI

pip install nyansasua

Nyansasua installs as the cire module:

import cire

print(cire.__version__)

Install from source

cd Cire
pip install -e .

This uses scikit-build-core + pybind11 to compile the C++ core and produce a wheel that bundles the compiled extension. Once installed:

import cire

# One-liner
for k in cire.extract_keywords("Hello world", top_k=5):
    print(k.text, k.score)

# Or use the high-level Extractor class
ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)
for k in ext.extract("Machine learning is a branch of AI."):
    print(k.text, k.score)

# Batch processing
results = ext.extract_many([
    "Python is widely used in data science and machine learning.",
    "Climate change is one of the biggest challenges facing humanity.",
])

# Corpus-driven TF-IDF (uses document frequency across many texts)
corpus = ["Python is used in data science.", "Java is used in enterprise.",
          "Python is great for scripting.", "Java runs on the JVM."]
kws = ext.extract_corpus_tfidf(corpus, "Python is popular for ML and AI.",
                               top_k=5)

Tenant dictionaries for domains

Use tenant dictionaries when different November agents need isolated domain vocabulary in memory at the same time.

import cire

cire.load_tenant_dictionary(
    "education",
    ["mathematics", "english", "fractions", "lesson_note", "B2"],
)
cire.load_tenant_dictionary(
    "banking",
    ["mobile_money", "microloan", "GHS", "susu"],
)

print(cire.snap_term("education", "mathematic")) # mathematics
print(cire.snap_term("education", "fracions")) # fractions
print(cire.snap_term("banking", "micro-loan")) # microloan
print(cire.snap_term("education", "micro-loan")) # micro-loan, no banking leakage

Tenant stopwords

Tenant stopwords are isolated overlays on top of built-in language stopwords.

import cire

cire.load_tenant_stopwords(
    "health",
    cire.Language.English,
    ["please", "show", "patient", "case"],
)

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.tenant_id = "health"
cfg.top_k = 5

for kw in cire.extract_keywords("Please show malaria treatment for this patient case", cfg):
    print(kw.text, kw.score)

Education lesson-note query example

This mirrors a November Education agent that extracts expected filters first, uses an alias map for semantic aliases, and lets Nyansasua snap remaining spelling variants with the Education tenant dictionary.

import cire

cire.load_tenant_dictionary(
    "education",
    [
        "B2",
        "english",
        "lesson_note",
        "GES",
        "core_competencies",
        "assessment_task",
    ],
)

aliases = {
    "basic 2": "B2",
    "english language": "english",
}

entities = {
    "grade": "Basic 2",
    "subject": "Englsh",
}

normalized = {}
for field, value in entities.items():
    key = value.lower()
    normalized[field] = aliases.get(key) or cire.snap_term("education", key, 2)

print(normalized)
# {'grade': 'B2', 'subject': 'english'}

Ghanaian language detection

import cire

samples = {
    "ewe": "ame ƒe nu",
    "hausa": "ɗan makaranta yana karatu",
    "ga": "ŋɔɔ kɛ sane",
    "fante": "me dɛ hom nyina",
}

for label, text in samples.items():
    lang = cire.detect_language(text)
    print(label, cire.language_name(lang), cire.language_code(lang))

Build the Python module directly (no scikit-build)

cd cpp
mkdir build && cd build
cmake .. -DCIRE_BUILD_PYTHON=ON -Dpybind11_DIR=$(python3 -m pybind11 --cmakedir)
cmake --build . -j
# The .so is dropped into python/cire/ by the top-level CMake hook.

Run the Python tests

cd python
pytest tests/

API surface

C++

Header	What it does
`cire/types.hpp`	`Language`, `Token`, `Keyword`, `Sentence`
`cire/tokenizer.hpp`	UTF-8 tokenizer + sentence splitter
`cire/stopwords.hpp`	Stopword lists + tenant overlays
`cire/snapper.hpp`	Tenant fuzzy dictionary snapping
`cire/tfidf.hpp`	TF-IDF extractor (corpus + single-doc fallback)
`cire/yake.hpp`	YAKE statistical extractor
`cire/textrank.hpp`	TextRank PageRank extractor
`cire/rake.hpp`	RAKE phrase extractor
`cire/extractor.hpp`	Top-level facade + ensemble + language detect

Python (`import cire`)

Symbol	What it does
`Extractor(language, algorithm, …)`	High-level facade class
`Language`, `Algorithm`	Enums (with string aliases)
`ExtractConfig`, `EnsembleConfig`	Per-call configuration
`extract_keywords(text, config)`	Run one algorithm
`extract_keywords_ensemble(…)`	Run all four, merge
`tokenize`, `split_sentences`	Low-level token utilities
`is_stopword`, `add_stopword`	Stopword inspection
`load_tenant_stopwords`	Tenant-specific stopword overlays
`load_tenant_dictionary`, `snap_term`	Tenant fuzzy dictionary snapping
`detect_language`	Heuristic script detection
`build_corpus_df`	Build a DF table for TF-IDF

Algorithm selection

Use case	Pick
Large corpus, need IDF signal	`TFIDF`
Single document, no corpus	`YAKE`
Want graph-based co-occurrence ranking	`TextRank`
Domain phrases (e.g. legal, medical)	`RAKE`
Best of all worlds	`ensemble`

Multi-language behavior

The tokenizer handles mixed-script text (e.g. "Hello世界world") correctly.
For CJK / Korean / Japanese, the stopword list contains function words; the tokenizer also splits every CJK char into its own 1-char token, which is the standard approach for these scripts.
The casing salience feature in YAKE is automatically a no-op for caseless scripts (CJK, Hangul, Arabic).
Use cire.detect_language(text) if you don't want to specify the language up front.
Ghanaian language detection is heuristic and works best when native Unicode characters such as ƒ, ʋ, ɗ, ɓ, ƙ, ŋ, ɛ, and ɔ are preserved.

Distribution

The Python package is built with scikit-build-core; for cross-platform wheels, configure cibuildwheel:

[tool.cibuildwheel]
build = ["cp38-*", "cp39-*", "cp310-*", "cp311-*", "cp312-*"]

Then:

python -m cibuildwheel --output-dir wheelhouse
twine upload wheelhouse/*

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.4

Jun 19, 2026

0.2.3

Jun 19, 2026

0.2.2

Jun 19, 2026

0.1.2

Jun 15, 2026

0.1.1

Jun 15, 2026

0.1.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nyansasua-0.2.4.tar.gz (90.1 kB view details)

Uploaded Jun 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl (255.1 kB view details)

Uploaded Jun 19, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file nyansasua-0.2.4.tar.gz.

File metadata

Download URL: nyansasua-0.2.4.tar.gz
Upload date: Jun 19, 2026
Size: 90.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for nyansasua-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`30bd5e7b34f3fb4633ec076a870b1212a86943f0ce9daf4bb577dc8dd6be27fc`
MD5	`981090f7a2682e0f8654e55349b37ac8`
BLAKE2b-256	`a850961b0dc4956bd37573f5ca7198f22e5cdda24c64fda6e672452a4ede03c1`

See more details on using hashes here.

File details

Details for the file nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Jun 19, 2026
Size: 255.1 kB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`d77bff64ec7bdd22fa22f96cabf37021007affd13f0a763568ce3653f0315e19`
MD5	`8ca55894eb4673cf0164a3ce96563b30`
BLAKE2b-256	`be8c289b7788b41377ebc569047fb0fd14c035afbf83e46d99114c2515dfdd60`

See more details on using hashes here.

nyansasua 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Cire / Nyansasua

Project layout

C++ quick start

Build (C++ only)

Python quick start

Install from PyPI

Install from source

Tenant dictionaries for domains

Tenant stopwords

Education lesson-note query example

Ghanaian language detection

Build the Python module directly (no scikit-build)

Run the Python tests

API surface

C++

Python (import cire)

Algorithm selection

Multi-language behavior

Distribution

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Python (`import cire`)