Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.
Project description
Cire / Nyansasua
Cire (Hausa) — knowledge / wisdom. Nyansasua (Twi) — learning / wisdom.
PyPI package:
nyansasua· C++ library: Cire
A self-contained, fast C++17 library for multi-language keyword extraction, with first-class Python bindings.
- No external dependencies for the C++ core (no ICU, no Boost).
- 18 languages with stopword lists: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Indonesian, Twi/Akan, Ga, Ewe, Hausa, Fante.
- Tenant-aware stopword overlays for isolated domain/agent dictionaries.
- BK-tree fuzzy snapping for tenant-scoped canonical term correction.
- 4 algorithms (run any one, or combine via ensemble):
- TF-IDF — single-doc entropy fallback or corpus-driven
- YAKE — statistical (Campos et al., 2020)
- TextRank — graph-based PageRank (Mihalcea & Tarau, 2004)
- RAKE — rapid automatic keyword extraction (Rose et al., 2010)
- UTF-8 everywhere — proper Unicode tokenizer with CJK / Cyrillic / Arabic / Hangul / Devanagari / Thai / Hiragana / Katakana support.
- C++17 + clean public API; pybind11 Python module ships in
python/.
Project layout
cire/
├── cpp/ C++ core
│ ├── include/cire/ Public headers
│ ├── src/ Implementations
│ ├── examples/ Demo program (9 languages)
│ ├── tests/ C++ test suite
│ └── CMakeLists.txt
├── python/ pybind11 Python wrapper
│ ├── bindings.cpp
│ ├── cire/ Python package
│ └── tests/ pytest suite
├── CMakeLists.txt Top-level build (optional)
├── pyproject.toml Python packaging metadata
└── README.md
C++ quick start
#include <cire/extractor.hpp>
#include <cstdio>
int main() {
std::string text = "Natural language processing (NLP) is a subfield of "
"linguistics, computer science, and artificial "
"intelligence concerned with the interactions between "
"computers and human language. Transformers have "
"revolutionized NLP.";
cire::EnsembleConfig cfg;
cfg.language = cire::Language::Auto;
cfg.top_k = 5;
for (const auto& k : cire::extract_keywords_ensemble(text, cfg)) {
std::printf("%-20s score=%.3f\n", k.text.c_str(), k.score);
}
}
Build (C++ only)
cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./cire_tests
./cire_bench
./cire_demo # multi-language demo
Python quick start
Install from PyPI
pip install nyansasua
Nyansasua installs as the cire module:
import cire
print(cire.__version__)
Install from source
cd Cire
pip install -e .
This uses scikit-build-core + pybind11 to compile the C++ core and produce
a wheel that bundles the compiled extension. Once installed:
import cire
# One-liner
for k in cire.extract_keywords("Hello world", top_k=5):
print(k.text, k.score)
# Or use the high-level Extractor class
ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)
for k in ext.extract("Machine learning is a branch of AI."):
print(k.text, k.score)
# Batch processing
results = ext.extract_many([
"Python is widely used in data science and machine learning.",
"Climate change is one of the biggest challenges facing humanity.",
])
# Corpus-driven TF-IDF (uses document frequency across many texts)
corpus = ["Python is used in data science.", "Java is used in enterprise.",
"Python is great for scripting.", "Java runs on the JVM."]
kws = ext.extract_corpus_tfidf(corpus, "Python is popular for ML and AI.",
top_k=5)
Tenant dictionaries for domains
Use tenant dictionaries when different November agents need isolated domain vocabulary in memory at the same time.
import cire
cire.load_tenant_dictionary(
"education",
["mathematics", "english", "fractions", "lesson_note", "B2"],
)
cire.load_tenant_dictionary(
"banking",
["mobile_money", "microloan", "GHS", "susu"],
)
print(cire.snap_term("education", "mathematic")) # mathematics
print(cire.snap_term("education", "fracions")) # fractions
print(cire.snap_term("banking", "micro-loan")) # microloan
print(cire.snap_term("education", "micro-loan")) # micro-loan, no banking leakage
Tenant stopwords
Tenant stopwords are isolated overlays on top of built-in language stopwords.
import cire
cire.load_tenant_stopwords(
"health",
cire.Language.English,
["please", "show", "patient", "case"],
)
cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.tenant_id = "health"
cfg.top_k = 5
for kw in cire.extract_keywords("Please show malaria treatment for this patient case", cfg):
print(kw.text, kw.score)
Education lesson-note query example
This mirrors a November Education agent that extracts expected filters first, uses an alias map for semantic aliases, and lets Nyansasua snap remaining spelling variants with the Education tenant dictionary.
import cire
cire.load_tenant_dictionary(
"education",
[
"B2",
"english",
"lesson_note",
"GES",
"core_competencies",
"assessment_task",
],
)
aliases = {
"basic 2": "B2",
"english language": "english",
}
entities = {
"grade": "Basic 2",
"subject": "Englsh",
}
normalized = {}
for field, value in entities.items():
key = value.lower()
normalized[field] = aliases.get(key) or cire.snap_term("education", key, 2)
print(normalized)
# {'grade': 'B2', 'subject': 'english'}
Ghanaian language detection
import cire
samples = {
"ewe": "ame ƒe nu",
"hausa": "ɗan makaranta yana karatu",
"ga": "ŋɔɔ kɛ sane",
"fante": "me dɛ hom nyina",
}
for label, text in samples.items():
lang = cire.detect_language(text)
print(label, cire.language_name(lang), cire.language_code(lang))
Build the Python module directly (no scikit-build)
cd cpp
mkdir build && cd build
cmake .. -DCIRE_BUILD_PYTHON=ON -Dpybind11_DIR=$(python3 -m pybind11 --cmakedir)
cmake --build . -j
# The .so is dropped into python/cire/ by the top-level CMake hook.
Run the Python tests
cd python
pytest tests/
API surface
C++
| Header | What it does |
|---|---|
cire/types.hpp |
Language, Token, Keyword, Sentence |
cire/tokenizer.hpp |
UTF-8 tokenizer + sentence splitter |
cire/stopwords.hpp |
Stopword lists + tenant overlays |
cire/snapper.hpp |
Tenant fuzzy dictionary snapping |
cire/tfidf.hpp |
TF-IDF extractor (corpus + single-doc fallback) |
cire/yake.hpp |
YAKE statistical extractor |
cire/textrank.hpp |
TextRank PageRank extractor |
cire/rake.hpp |
RAKE phrase extractor |
cire/extractor.hpp |
Top-level facade + ensemble + language detect |
Python (import cire)
| Symbol | What it does |
|---|---|
Extractor(language, algorithm, …) |
High-level facade class |
Language, Algorithm |
Enums (with string aliases) |
ExtractConfig, EnsembleConfig |
Per-call configuration |
extract_keywords(text, config) |
Run one algorithm |
extract_keywords_ensemble(…) |
Run all four, merge |
tokenize, split_sentences |
Low-level token utilities |
is_stopword, add_stopword |
Stopword inspection |
load_tenant_stopwords |
Tenant-specific stopword overlays |
load_tenant_dictionary, snap_term |
Tenant fuzzy dictionary snapping |
detect_language |
Heuristic script detection |
build_corpus_df |
Build a DF table for TF-IDF |
Algorithm selection
| Use case | Pick |
|---|---|
| Large corpus, need IDF signal | TFIDF |
| Single document, no corpus | YAKE |
| Want graph-based co-occurrence ranking | TextRank |
| Domain phrases (e.g. legal, medical) | RAKE |
| Best of all worlds | ensemble |
Multi-language behavior
- The tokenizer handles mixed-script text (e.g.
"Hello世界world") correctly. - For CJK / Korean / Japanese, the stopword list contains function words; the tokenizer also splits every CJK char into its own 1-char token, which is the standard approach for these scripts.
- The casing salience feature in YAKE is automatically a no-op for caseless scripts (CJK, Hangul, Arabic).
- Use
cire.detect_language(text)if you don't want to specify the language up front. - Ghanaian language detection is heuristic and works best when native Unicode
characters such as
ƒ,ʋ,ɗ,ɓ,ƙ,ŋ,ɛ, andɔare preserved.
Distribution
The Python package is built with scikit-build-core; for cross-platform
wheels, configure cibuildwheel:
[tool.cibuildwheel]
build = ["cp38-*", "cp39-*", "cp310-*", "cp311-*", "cp312-*"]
Then:
python -m cibuildwheel --output-dir wheelhouse
twine upload wheelhouse/*
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nyansasua-0.2.4.tar.gz.
File metadata
- Download URL: nyansasua-0.2.4.tar.gz
- Upload date:
- Size: 90.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30bd5e7b34f3fb4633ec076a870b1212a86943f0ce9daf4bb577dc8dd6be27fc
|
|
| MD5 |
981090f7a2682e0f8654e55349b37ac8
|
|
| BLAKE2b-256 |
a850961b0dc4956bd37573f5ca7198f22e5cdda24c64fda6e672452a4ede03c1
|
File details
Details for the file nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: nyansasua-0.2.4-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 255.1 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d77bff64ec7bdd22fa22f96cabf37021007affd13f0a763568ce3653f0315e19
|
|
| MD5 |
8ca55894eb4673cf0164a3ce96563b30
|
|
| BLAKE2b-256 |
be8c289b7788b41377ebc569047fb0fd14c035afbf83e46d99114c2515dfdd60
|