Skip to main content

AI-powered semantic keyword extraction using sentence embeddings and MMR

Project description

semantic-keywords icon

semantic-keywords

AI-powered semantic keyword extraction โ€” offline, fast, and actually useful.

CI PyPI version Python 3.9+ License: MIT Downloads


๐Ÿ“– Landing Page ย ยทย  ๐Ÿ“ฆ PyPI ย ยทย  ๐Ÿ› Issues



TF-IDF counts words. semantic-keywords understands meaning.

It uses sentence embeddings (all-MiniLM-L6-v2 by default) and Maximal Marginal Relevance (MMR) to return keywords that are both relevant and diverse โ€” not just the most frequent phrases. Works fully offline after a one-time model download. No API key. No rate limits.

Input  โ†’ "Tanzania is a hub for mobile money and fintech startups in East Africa."

Output โ†’ mobile money       0.5134  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘
         fintech startups   0.4901  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘
         east africa        0.4710  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘
         financial access   0.4502  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘
         agricultural tools 0.4388  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘

Table of contents


Install

pip install semantic-keywords

With PDF support:

pip install "semantic-keywords[files]"

Download a model (one-time, then fully offline):

# Quickest โ€” 90 MB, works great for most use cases
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

Or use the interactive downloader bundled with the repo:

python download_model.py

Docker (quick start)

No Python install needed โ€” run directly in a container:

# Pull and run inline text
docker run --rm ronaldgosso/semantic-keywords "Tanzania fintech mobile money"

# Extract from a file
docker run --rm -v ./documents:/data ronaldgosso/semantic-keywords --file /data/report.pdf

# Interactive mode
docker run --rm -it ronaldgosso/semantic-keywords

Full Docker guide: See README_DOCKER.md for build instructions, compose usage, and production deployment.


Quick start

Python API

from semantic_keywords import extract

# Basic โ€” returns top 5 keywords
results = extract("Tanzania is a hub for mobile money and fintech startups.")

for r in results:
    print(r["score"], r["keyword"])

# 0.5134  mobile money
# 0.4901  fintech startups
# 0.4710  east africa
# Full control
results = extract(
    text      = "your paragraph or document here",
    top_n     = 10,          # how many keywords to return
    min_score = 0.25,        # only keep keywords above this similarity score
    diversity = 0.7,         # 0.0 = most relevant, 1.0 = most varied
    model     = "balanced",  # "fast" | "balanced" | "accurate"
)

CLI

# Interactive guided mode โ€” prompts you for text or a file path
semkw

# Inline text
semkw "Tanzania fintech mobile money startups"

# Top N with score table
semkw "climate change arctic ice melting" --top 8 --scores

# Pipe from stdin
echo "neural networks deep learning transformers" | semkw -n 3

File extraction

Extract keywords directly from .pdf, .txt, and .md files.

Python API

from semantic_keywords import extract_file

# One-call file extraction
result = extract_file("annual_report.pdf", top_n=10)

print(result["file"])      # "annual_report.pdf"
print(result["size_kb"])   # 284.1
print(result["words"])     # 6203

for kw in result["keywords"]:
    print(kw["score"], kw["keyword"])
# Two-step: read then extract separately
from semantic_keywords import read_file, extract

text    = read_file("notes.txt")        # returns raw string
results = extract(text, top_n=5)

extract_file() returns:

Key Type Description
file str Filename (not full path)
size_kb float File size in KB
words int Word count of extracted text
model str Model alias used
keywords list[dict] [{"keyword": str, "score": float}, ...]

CLI

# Extract from a PDF
semkw --file report.pdf

# Top 10 with scores
semkw --file report.pdf --top 10 --scores

# Drag and drop the path in interactive mode
semkw
# โ†’ choose [2] Load from file
# โ†’ paste or drag the file path

PDF requirements

PDF support requires pypdf:

pip install pypdf
# or
pip install "semantic-keywords[files]"

Note: Image-only / scanned PDFs contain no extractable text. Run them through OCR (e.g. Adobe Acrobat, Tesseract) before using this package. Password-protected PDFs must be decrypted first.


CLI reference

semkw [TEXT] [OPTIONS]
Argument / Flag Default Description
TEXT โ€” Inline text to extract from. Omit for interactive mode.
--file, -f PATH โ€” Path to a .pdf, .txt, or .md file.
--top, -n N 5 Maximum keywords to return.
--model, -m MODEL auto fast ยท balanced ยท accurate
--min-score FLOAT 0.20 Minimum cosine similarity threshold (0.0โ€“1.0).
--diversity FLOAT 0.70 MMR balance: 0.0 = most relevant, 1.0 = most varied.
--scores off Print ranked score table instead of plain list.
--list-models โ€” Show all models and download status, then exit.

Examples:

semkw                                              # interactive guided mode
semkw "your text here"                             # inline, default top 5
semkw "your text here" -n 3                        # top 3
semkw "your text here" --scores                    # with score table
semkw --file report.pdf                            # from PDF
semkw --file report.pdf -n 10 --model accurate     # PDF, top 10, best model
semkw --file notes.txt --scores                    # txt with scores
semkw --list-models                                # show downloaded models
echo "deep learning transformers" | semkw -n 3     # pipe

Python API reference

Google Colab Example Link

extract(text, **kwargs) โ†’ list[dict]

from semantic_keywords import extract

results = extract(
    text      : str,            # input document
    top_n     : int   = 5,      # max keywords to return
    min_score : float = 0.20,   # minimum cosine similarity (0.0โ€“1.0)
    max_words : int   = 3,      # max words per keyword phrase
    model     : str   = "fast", # model alias or HuggingFace model name
    diversity : float = 0.7,    # MMR diversity factor (0.0โ€“1.0)
)
# โ†’ [{"keyword": "mobile money", "score": 0.5134}, ...]

extract_file(file_path, **kwargs) โ†’ dict

from semantic_keywords import extract_file

result = extract_file(
    file_path : str | Path,     # path to .pdf, .txt, or .md
    top_n     : int   = 5,
    min_score : float = 0.20,
    max_words : int   = 3,
    model     : str   = "fast",
    diversity : float = 0.7,
)
# โ†’ {"file": "report.pdf", "size_kb": 142.3, "words": 4821,
#    "model": "fast", "keywords": [...]}

read_file(file_path) โ†’ str

from semantic_keywords import read_file

text = read_file("report.pdf")   # raw extracted text string

detect_available_models() โ†’ dict

from semantic_keywords import detect_available_models

available = detect_available_models()
# โ†’ {"fast": {"hf_name": "all-MiniLM-L6-v2", "size": "90MB", ...}}

list_models() โ†’ dict

from semantic_keywords import list_models

all_models = list_models()
# โ†’ full MODEL_REGISTRY dict including models not yet downloaded

Model options

Alias HuggingFace model Size Speed Best for
fast (default) all-MiniLM-L6-v2 90 MB fastest Most use cases
balanced all-MiniLM-L12-v2 120 MB medium Better accuracy
accurate all-mpnet-base-v2 420 MB slowest Research / high precision
(custom) any HuggingFace model name varies varies Advanced users

All models run fully offline after the first download. The package auto-detects which models are present and shows a menu when multiple are available.

Download additional models:

python download_model.py

Use a custom HuggingFace model:

results = extract("your text", model="BAAI/bge-small-en-v1.5")

Configuration

min_score โ€” precision vs recall

Value Effect
0.10 Very broad โ€” returns many keywords, some loosely related
0.20 Default โ€” balanced precision
0.30 Strict โ€” only highly relevant keywords
0.40+ Very strict โ€” few but precise keywords

diversity โ€” MMR balance

Value Effect
0.0 Pure relevance โ€” top keywords may paraphrase each other
0.7 Default โ€” relevant and varied
1.0 Pure diversity โ€” maximally varied, may miss the most relevant phrase

max_words โ€” phrase length

Value Effect
1 Single words only
2 Up to bigrams (e.g. "mobile money")
3 Up to trigrams โ€” default, catches most meaningful phrases

Contributing

Contributions are welcome! See CONTRIBUTING.md for the full developer guide, including:

  • Fork and local setup instructions
  • Running tests and linters
  • Making a release
  • Adding new models
  • Docker development workflow

Quick contributor setup

# Fork on GitHub, then clone your fork
git clone https://github.com/<your-username>/semantic-keywords.git
cd semantic-keywords

# Create and activate a virtual environment
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Download a model
python download_model.py

Project structure

semantic-keywords/
โ”œโ”€โ”€ semantic_keywords/          # installable package
โ”‚   โ”œโ”€โ”€ __init__.py             # public API surface
โ”‚   โ”œโ”€โ”€ extractor.py            # embeddings, MMR, model registry
โ”‚   โ”œโ”€โ”€ reader.py               # PDF / txt / md file reading
โ”‚   โ”œโ”€โ”€ file_api.py             # extract_file() โ€” reader + extractor combined
โ”‚   โ””โ”€โ”€ cli.py                  # semkw CLI entry point
โ”œโ”€โ”€ docs/
โ”‚   โ””โ”€โ”€ index.html              # GitHub Pages landing page
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ”œโ”€โ”€ ci.yml              # lint on every push
โ”‚       โ”œโ”€โ”€ publish.yml         # publish to PyPI on version tag
โ”‚       โ”œโ”€โ”€ docker.yml          # build & push Docker image
โ”‚       โ””โ”€โ”€ pages.yml           # deploy docs on push to main
โ”œโ”€โ”€ pyproject.toml              # package metadata + tool config
โ”œโ”€โ”€ Dockerfile                  # multi-stage Docker build
โ”œโ”€โ”€ docker-compose.yml          # Docker Compose for local usage
โ”œโ”€โ”€ .dockerignore               # files to exclude from Docker build
โ”œโ”€โ”€ README.md                   # this file โ€” user documentation
โ”œโ”€โ”€ README_DOCKER.md            # Docker-specific instructions
โ”œโ”€โ”€ CONTRIBUTING.md             # developer guide
โ”œโ”€โ”€ test_extractor.py           # test suite + interactive demo
โ””โ”€โ”€ download_model.py           # interactive model downloader

Changelog

v0.2.0

  • Added extract_file() โ€” keyword extraction directly from .pdf, .txt, .md
  • Added read_file() and file_info() utilities
  • Added --file / -f flag to the CLI
  • Interactive mode now offers text input or file path as input options
  • pypdf added as optional dependency (pip install semantic-keywords[files])
  • Bumped __version__ to 0.2.0

v0.1.0

  • Initial release
  • extract() with MMR ranking
  • Three model tiers: fast, balanced, accurate
  • Auto model detection from HuggingFace cache
  • Interactive CLI (semkw) with guided prompts
  • Stdin pipe support

Links

Resource URL
Landing page https://ronaldgosso.github.io/semantic-keywords
PyPI https://pypi.org/project/semantic-keywords/
GitHub https://github.com/ronaldgosso/semantic-keywords
Issues https://github.com/ronaldgosso/semantic-keywords/issues
CI status https://github.com/ronaldgosso/semantic-keywords/actions
Contributing guide CONTRIBUTING.md
Docker guide README_DOCKER.md

License

MIT ยฉ Ronald Isack Gosso

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_keywords-0.2.7.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_keywords-0.2.7-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file semantic_keywords-0.2.7.tar.gz.

File metadata

  • Download URL: semantic_keywords-0.2.7.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for semantic_keywords-0.2.7.tar.gz
Algorithm Hash digest
SHA256 c8925ef1ec78727dfaca396401ff8a3370a82823e8f682c01e4cbfa54a5722b1
MD5 960ab57fc13414b4460211f1da499dfd
BLAKE2b-256 4e99c29415de5839148736e66b7842248bcefbb6f8ec4c7a60c7dfbe411a5918

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_keywords-0.2.7.tar.gz:

Publisher: publish.yml on ronaldgosso/semantic-keywords

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semantic_keywords-0.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_keywords-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 919b69bd433f629fc127bfaf3c667594355cc9a0c56d27eeaf5d3aebec105032
MD5 03ad87178c57cce02432b7b6898ab401
BLAKE2b-256 98f8cfec5ac60a9535d6acdca73038989124276b7b66c37cf251775265ee65fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_keywords-0.2.7-py3-none-any.whl:

Publisher: publish.yml on ronaldgosso/semantic-keywords

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page