AI-powered semantic keyword extraction using sentence embeddings and MMR
Project description
semantic-keywords
AI-powered semantic keyword extraction โ offline, fast, and actually useful.
๐ Landing Page ย ยทย ๐ฆ PyPI ย ยทย ๐ Issues
TF-IDF counts words. semantic-keywords understands meaning.
It uses sentence embeddings (all-MiniLM-L6-v2 by default) and Maximal Marginal Relevance (MMR) to return keywords that are both relevant and diverse โ not just the most frequent phrases. Works fully offline after a one-time model download. No API key. No rate limits.
Input โ "Tanzania is a hub for mobile money and fintech startups in East Africa."
Output โ mobile money 0.5134 โโโโโโโโโโโโโโโโโโโโโโโโ
fintech startups 0.4901 โโโโโโโโโโโโโโโโโโโโโโโโ
east africa 0.4710 โโโโโโโโโโโโโโโโโโโโโโโโ
financial access 0.4502 โโโโโโโโโโโโโโโโโโโโโโโโ
agricultural tools 0.4388 โโโโโโโโโโโโโโโโโโโโโโโโ
Table of contents
- Install
- Quick start
- File extraction (PDF, TXT, MD)
- CLI reference
- Python API reference
- Model options
- Configuration
- Developer guide
- Project structure
- Changelog
Install
pip install semantic-keywords
With PDF support:
pip install "semantic-keywords[files]"
Download a model (one-time, then fully offline):
# Quickest โ 90 MB, works great for most use cases
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
Or use the interactive downloader bundled with the repo:
python download_model.py
Quick start
Python API
from semantic_keywords import extract
# Basic โ returns top 5 keywords
results = extract("Tanzania is a hub for mobile money and fintech startups.")
for r in results:
print(r["score"], r["keyword"])
# 0.5134 mobile money
# 0.4901 fintech startups
# 0.4710 east africa
# Full control
results = extract(
text = "your paragraph or document here",
top_n = 10, # how many keywords to return
min_score = 0.25, # only keep keywords above this similarity score
diversity = 0.7, # 0.0 = most relevant, 1.0 = most varied
model = "balanced", # "fast" | "balanced" | "accurate"
)
CLI
# Interactive guided mode โ prompts you for text or a file path
semkw
# Inline text
semkw "Tanzania fintech mobile money startups"
# Top N with score table
semkw "climate change arctic ice melting" --top 8 --scores
# Pipe from stdin
echo "neural networks deep learning transformers" | semkw -n 3
File extraction
Extract keywords directly from .pdf, .txt, and .md files.
Python API
from semantic_keywords import extract_file
# One-call file extraction
result = extract_file("annual_report.pdf", top_n=10)
print(result["file"]) # "annual_report.pdf"
print(result["size_kb"]) # 284.1
print(result["words"]) # 6203
for kw in result["keywords"]:
print(kw["score"], kw["keyword"])
# Two-step: read then extract separately
from semantic_keywords import read_file, extract
text = read_file("notes.txt") # returns raw string
results = extract(text, top_n=5)
extract_file() returns:
| Key | Type | Description |
|---|---|---|
file |
str |
Filename (not full path) |
size_kb |
float |
File size in KB |
words |
int |
Word count of extracted text |
model |
str |
Model alias used |
keywords |
list[dict] |
[{"keyword": str, "score": float}, ...] |
CLI
# Extract from a PDF
semkw --file report.pdf
# Top 10 with scores
semkw --file report.pdf --top 10 --scores
# Drag and drop the path in interactive mode
semkw
# โ choose [2] Load from file
# โ paste or drag the file path
PDF requirements
PDF support requires pypdf:
pip install pypdf
# or
pip install "semantic-keywords[files]"
Note: Image-only / scanned PDFs contain no extractable text. Run them through OCR (e.g. Adobe Acrobat, Tesseract) before using this package. Password-protected PDFs must be decrypted first.
CLI reference
semkw [TEXT] [OPTIONS]
| Argument / Flag | Default | Description |
|---|---|---|
TEXT |
โ | Inline text to extract from. Omit for interactive mode. |
--file, -f PATH |
โ | Path to a .pdf, .txt, or .md file. |
--top, -n N |
5 |
Maximum keywords to return. |
--model, -m MODEL |
auto | fast ยท balanced ยท accurate |
--min-score FLOAT |
0.20 |
Minimum cosine similarity threshold (0.0โ1.0). |
--diversity FLOAT |
0.70 |
MMR balance: 0.0 = most relevant, 1.0 = most varied. |
--scores |
off | Print ranked score table instead of plain list. |
--list-models |
โ | Show all models and download status, then exit. |
Examples:
semkw # interactive guided mode
semkw "your text here" # inline, default top 5
semkw "your text here" -n 3 # top 3
semkw "your text here" --scores # with score table
semkw --file report.pdf # from PDF
semkw --file report.pdf -n 10 --model accurate # PDF, top 10, best model
semkw --file notes.txt --scores # txt with scores
semkw --list-models # show downloaded models
echo "deep learning transformers" | semkw -n 3 # pipe
Python API reference
extract(text, **kwargs) โ list[dict]
from semantic_keywords import extract
results = extract(
text : str, # input document
top_n : int = 5, # max keywords to return
min_score : float = 0.20, # minimum cosine similarity (0.0โ1.0)
max_words : int = 3, # max words per keyword phrase
model : str = "fast", # model alias or HuggingFace model name
diversity : float = 0.7, # MMR diversity factor (0.0โ1.0)
)
# โ [{"keyword": "mobile money", "score": 0.5134}, ...]
extract_file(file_path, **kwargs) โ dict
from semantic_keywords import extract_file
result = extract_file(
file_path : str | Path, # path to .pdf, .txt, or .md
top_n : int = 5,
min_score : float = 0.20,
max_words : int = 3,
model : str = "fast",
diversity : float = 0.7,
)
# โ {"file": "report.pdf", "size_kb": 142.3, "words": 4821,
# "model": "fast", "keywords": [...]}
read_file(file_path) โ str
from semantic_keywords import read_file
text = read_file("report.pdf") # raw extracted text string
detect_available_models() โ dict
from semantic_keywords import detect_available_models
available = detect_available_models()
# โ {"fast": {"hf_name": "all-MiniLM-L6-v2", "size": "90MB", ...}}
list_models() โ dict
from semantic_keywords import list_models
all_models = list_models()
# โ full MODEL_REGISTRY dict including models not yet downloaded
Model options
| Alias | HuggingFace model | Size | Speed | Best for |
|---|---|---|---|---|
fast (default) |
all-MiniLM-L6-v2 |
90 MB | fastest | Most use cases |
balanced |
all-MiniLM-L12-v2 |
120 MB | medium | Better accuracy |
accurate |
all-mpnet-base-v2 |
420 MB | slowest | Research / high precision |
| (custom) | any HuggingFace model name | varies | varies | Advanced users |
All models run fully offline after the first download. The package auto-detects which models are present and shows a menu when multiple are available.
Download additional models:
python download_model.py
Use a custom HuggingFace model:
results = extract("your text", model="BAAI/bge-small-en-v1.5")
Configuration
min_score โ precision vs recall
| Value | Effect |
|---|---|
0.10 |
Very broad โ returns many keywords, some loosely related |
0.20 |
Default โ balanced precision |
0.30 |
Strict โ only highly relevant keywords |
0.40+ |
Very strict โ few but precise keywords |
diversity โ MMR balance
| Value | Effect |
|---|---|
0.0 |
Pure relevance โ top keywords may paraphrase each other |
0.7 |
Default โ relevant and varied |
1.0 |
Pure diversity โ maximally varied, may miss the most relevant phrase |
max_words โ phrase length
| Value | Effect |
|---|---|
1 |
Single words only |
2 |
Up to bigrams (e.g. "mobile money") |
3 |
Up to trigrams โ default, catches most meaningful phrases |
Developer guide
Fork and set up locally
# 1. Fork on GitHub, then clone your fork
git clone https://github.com/<your-username>/semantic-keywords.git
cd semantic-keywords
# 2. Create and activate a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
# 3. Install in editable mode with all dev dependencies
pip install -e ".[dev]"
# 4. Download at least one model
python download_model.py
Run the test suite
python test_extractor.py
This detects downloaded models, prompts you to pick one, runs all automated tests (text + file extraction + edge cases + error handling), then drops into a live interactive demo.
Linting and formatting
# Check for issues
ruff check semantic_keywords/
black --check semantic_keywords/
mypy semantic_keywords/
# Auto-fix what can be fixed automatically
ruff check --fix semantic_keywords/
black semantic_keywords/
All three must pass before opening a pull request. The CI workflow runs them automatically on every push.
Running the CLI locally (editable install)
After pip install -e ., the semkw command is live and points at your source files โ any edit you make is reflected immediately without reinstalling.
semkw "your test text here" --scores
semkw --file path/to/test.pdf -n 10
semkw --list-models
Making a release
# Bump version in pyproject.toml and __init__.py
# Then tag and push โ the publish workflow fires automatically
git add .
git commit -m "release: v0.2.0"
git tag v0.2.0
git push && git push --tags
The publish.yml workflow builds the wheel and uploads to PyPI using OIDC trusted publishing โ no API token needed.
GitHub Actions workflows
| Workflow | Trigger | What it does |
|---|---|---|
ci.yml |
Every push / PR to main |
ruff + black + mypy |
publish.yml |
Push a v*.*.* tag |
Build wheel + upload to PyPI |
pages.yml |
Every push to main |
Deploy docs/ to GitHub Pages |
Adding a new model
Open semantic_keywords/extractor.py and add an entry to MODEL_REGISTRY:
MODEL_REGISTRY: dict[str, dict[str, str]] = {
"fast": {"hf_name": "all-MiniLM-L6-v2", "size": "90MB", "note": "..."},
"balanced": {"hf_name": "all-MiniLM-L12-v2", "size": "120MB", "note": "..."},
"accurate": {"hf_name": "all-mpnet-base-v2", "size": "420MB", "note": "..."},
"your-alias": {"hf_name": "org/model-name", "size": "???MB", "note": "..."}, # โ add here
}
No other changes needed โ the CLI menu, detection logic, and API all pick it up automatically.
Contributing
- Fork the repo
- Create a feature branch:
git checkout -b feat/your-feature - Make your changes and ensure all linters pass
- Open a pull request against
main
Please open an issue first for significant changes so we can discuss the approach.
Project structure
semantic-keywords/
โโโ semantic_keywords/ # installable package
โ โโโ __init__.py # public API surface
โ โโโ extractor.py # embeddings, MMR, model registry
โ โโโ reader.py # PDF / txt / md file reading
โ โโโ file_api.py # extract_file() โ reader + extractor combined
โ โโโ cli.py # semkw CLI entry point
โโโ docs/
โ โโโ index.html # GitHub Pages landing page
โโโ .github/
โ โโโ workflows/
โ โโโ ci.yml # lint on every push
โ โโโ publish.yml # publish to PyPI on version tag
โ โโโ pages.yml # deploy docs on push to main
โโโ pyproject.toml # package metadata + tool config
โโโ README.md
โโโ test_extractor.py # test suite + interactive demo
โโโ download_model.py # interactive model downloader
Changelog
v0.2.0
- Added
extract_file()โ keyword extraction directly from.pdf,.txt,.md - Added
read_file()andfile_info()utilities - Added
--file/-fflag to the CLI - Interactive mode now offers text input or file path as input options
pypdfadded as optional dependency (pip install semantic-keywords[files])- Bumped
__version__to0.2.0
v0.1.0
- Initial release
extract()with MMR ranking- Three model tiers:
fast,balanced,accurate - Auto model detection from HuggingFace cache
- Interactive CLI (
semkw) with guided prompts - Stdin pipe support
Links
License
MIT ยฉ Ronald Isack Gosso
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_keywords-0.2.3.tar.gz.
File metadata
- Download URL: semantic_keywords-0.2.3.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d0ba441f8e6e87910e5b7304a9cca8942eeed3d7b9b02093004c2f784acc670
|
|
| MD5 |
532488bfa0d4b28bc02cdcbd3fe2f207
|
|
| BLAKE2b-256 |
0b33d51928c5d79e2b18db194be41bf1fa81b8ee1894c6b6e59b92b6b0d6753c
|
Provenance
The following attestation bundles were made for semantic_keywords-0.2.3.tar.gz:
Publisher:
publish.yml on ronaldgosso/semantic-keywords
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_keywords-0.2.3.tar.gz -
Subject digest:
3d0ba441f8e6e87910e5b7304a9cca8942eeed3d7b9b02093004c2f784acc670 - Sigstore transparency entry: 1191465660
- Sigstore integration time:
-
Permalink:
ronaldgosso/semantic-keywords@2603dba41aaf037033e783610079d326f45a01e7 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/ronaldgosso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2603dba41aaf037033e783610079d326f45a01e7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file semantic_keywords-0.2.3-py3-none-any.whl.
File metadata
- Download URL: semantic_keywords-0.2.3-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50f04ee49bc17f016dd40bca9ead18bbc62217b35ef1bc3f6628f96685ce06b0
|
|
| MD5 |
34e5c32b68497df4cb751d3619a8418c
|
|
| BLAKE2b-256 |
96844006780cccff3a5aeac6cb885f021197b76598793766e70636412039cc3c
|
Provenance
The following attestation bundles were made for semantic_keywords-0.2.3-py3-none-any.whl:
Publisher:
publish.yml on ronaldgosso/semantic-keywords
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_keywords-0.2.3-py3-none-any.whl -
Subject digest:
50f04ee49bc17f016dd40bca9ead18bbc62217b35ef1bc3f6628f96685ce06b0 - Sigstore transparency entry: 1191465661
- Sigstore integration time:
-
Permalink:
ronaldgosso/semantic-keywords@2603dba41aaf037033e783610079d326f45a01e7 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/ronaldgosso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2603dba41aaf037033e783610079d326f45a01e7 -
Trigger Event:
push
-
Statement type: