Modification of the KeyBERT method to extract keywords and keyphrases using chunks. This provides better results, especialy when handling long documents.

These details have not been verified by PyPI

Project links

Project description

Tests

ChunkeyBERT - Unsupervised Keyword Extraction from Long Documents

Overview

ChunkeyBert is a minimal and easy-to-use keyword extraction technique that leverages embeddings for unsupervised keyphrase extraction from text documents. ChunkeyBert is a modification of the KeyBERT method to handle documents with arbitrary length with better results. ChunkeyBERT works by chunking the documents and uses KeyBERT to extract candidate keywords/keyphrases from all chunks followed by a configurable merge stage to produce the final keywords for the entire document. ChunkeyBert can use any document chunking method as long as it can be wrapped in a simple function, however it can also work without a chunker and process the entire document as a single chunk. ChunkeyBert works with any configuration of KeyBERT and can handle batches of documents.

Installation

Install from PyPI using pip (preferred method):

pip install chunkey-bert

Use Cases

ChunkeyBERT is most useful when you want lightweight, deterministic keyword extraction for long documents without relying on an LLM. Typical use cases include:

tagging and indexing large document collections
search, clustering, and downstream retrieval features
privacy-sensitive or offline processing pipelines
batch extraction where LLM latency or cost would be too high
generating candidate phrases for later reranking or normalization by an LLM

In LLM-based systems, ChunkeyBERT is often best used as a cheap first-stage candidate extraction layer rather than as a replacement for generative semantic analysis.

Details

How does ChunkeyBERT differs to KeyBERT?

ChunkeyBERT differs from KeyBERT primarily in its approach to handling long documents for keyword extraction. While KeyBERT directly applies keyword extraction techniques to the entire document, ChunkeyBERT introduces an additional step of chunking the document into smaller, manageable pieces before applying KeyBERT's keyword extraction methods. This modification aims to improve the performance and relevance of the extracted keywords, especially for longer documents where directly applying KeyBERT might not yield optimal results due to the complexity and size of the document. Here are the key differences:

Document chunking: ChunkeyBERT uses a chunking method to divide a long document into smaller chunks. This is done through the chunker parameter in the extract_keywords method. The chunker can be any callable that takes a string (the document) and returns a list of strings (the chunks). This allows ChunkeyBERT to process each chunk independently, making it more effective at handling long documents. A chunker could be as simple as

chunker: Callable[[str], List[str]] = lambda text: [t for t in text.split("\n\n") if len(t) > 25]

or can wrap more complicated logic such as a Langchain chunker for example.

Handling of chunks: After chunking, ChunkeyBERT applies KeyBERT's keyword extraction to each chunk separately.

Keyword scoring and selection: ChunkeyBERT introduces additional logic to score and select keywords based on their occurrence across different chunks and their similarity.

Merge strategies: ChunkeyBERT supports multiple strategies for merging chunk-level keywords into document-level results. The default "similarity" strategy ranks keywords by semantic centrality, while "count" ranks them by repetition across chunks. Custom merge callables are also supported.

Flexibility in keyword extraction

ChunkeyBERT offers flexibility in keyword extraction in a number of ways. It can work with any configuration of KeyBERT and exposes a superset of KeyBERT's extract_keywords() API, which allow fine-tuning of the keyword extraction process based on the characteristics of the chunks and the overall document. It can also work with any chunking method including semantic chunking, chunk filtering and even sampling from the document to finetune the process. ChunkeyBERT can be configured to consider the multiplicity of keywords across chunks to account for repetitions, and the merge strategy can be selected explicitly depending on the behavior you want.

Batching and GPU support

ChunkeyBERT works with document batches and attempts to process these batches in parallel on the GPU if possible.

Compatible with KeyBERT return values

ChunkeyBERT returns results in a format similar to KeyBERT but can also optionally return the embeddings for each of the keywords extracted.

Usage

The following steps describe a basic example on how use ChunkeyBert for keyword extraction:

Install ChunkeyBert: First, ensure that ChunkeyBert is installed in your environment. You can install it using pip as shown below:

pip install chunkey-bert

Import required libraries: Import the necessary libraries including ChunkeyBert, KeyBERT, and any other dependencies you might need for your specific use case.

from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
from chunkey_bert.model import ChunkeyBert

Initialize KeyBERT: this could be done for example using a Sentence Transformer model that is used to generate embeddings for the text. Note that the quality of extracted keywords depends greatly on how KeyBERT is configured, so it is required to understand how to use KeyBERT effectively.

sentence_model = SentenceTransformer(model_name_or_path="all-MiniLM-L6-v2")
keybert = KeyBERT(model=sentence_model)

Define a chunker function (optional): If you want to chunk your text into smaller parts (which is the main feature of ChunkeyBert), define a chunker function. This function takes a string and returns a list of strings (chunks). If you don't provide a chunker, ChunkeyBert will process the entire document as a single chunk but will still apply a different keywords selection method to KeyBERT. Here is an example of a very simple chunker:

chunker = lambda text: [t for t in text.split("\n\n") if len(t) > 25]  # Example chunker that splits text into paragraphs

Create a ChunkeyBert instance: Initialize ChunkeyBert with the KeyBERT instance you created earlier.

chunkey_bert = ChunkeyBert(keybert=keybert)

Extract keywords: Use the extract_keywords method of ChunkeyBert to extract keywords from your document. You can specify the number of keywords, whether to use the chunker, the merge strategy, and other parameters related to keyword extraction and to KeyBERT.extract_keywords.

text = "Your long document text goes here..."

keywords = chunkey_bert.extract_keywords(
    docs=text,
    num_keywords=10,
    chunker=chunker,  # Pass your chunker here. If None, the entire document is treated as a single chunk.
    merge_strategy="similarity",  # Built-in strategies: "similarity" (default) or "count".
    top_n=3,  # Number of keywords to extract from each chunk
    nr_candidates=20,  # Number of candidate keywords/keyphrases to consider from each chunk
)

print(keywords)

To rank keywords by repetition across chunks instead of semantic centrality:

keywords = chunkey_bert.extract_keywords(
    docs=text,
    num_keywords=10,
    chunker=chunker,
    merge_strategy="count",
    use_count_weights=True,  # Required by the built-in "count" strategy.
    top_n=3,
)

Custom merge logic can also be provided as a callable:

def my_merge_strategy(
    embeddings_doc: np.ndarray,
    counts_doc: Optional[np.ndarray],
    top_k: Optional[int],
) -> tuple[np.ndarray, np.ndarray]:
    ...

See a more advanced example in this notebook: https://nbviewer.org/github/yaniv-shulman/chunkey-bert/tree/main/src/experiments/

Experimental results

Example notebooks are available at https://nbviewer.org/github/yaniv-shulman/chunkey-bert/tree/main/src/experiments/.

Benchmarks

The repository now includes a reproducible long-document benchmark runner:

source configure.sh
python src/experiments/run_keyword_benchmarks.py

After sourcing configure.sh, benchmark outputs are written to ${REPO_DIR}/out/benchmarks and ${OUT_DIR} is exported automatically. The runner writes:

per-profile JSON results with per-document metrics
a markdown summary report
simple SVG comparison charts

The current built-in benchmark profiles focus on long scientific papers from midas/krapivin, which is a better fit for ChunkeyBert than short web snippets because the method is explicitly designed for long documents.

Current benchmark snapshot

On the 100 longest documents from the midas/krapivin test split:

average document length: 14,937.68 tokens
exact F1@10: 0.0117 for KeyBERT vs 0.0756 for ChunkeyBert
stemmed F1@10: 0.0182 for KeyBERT vs 0.0896 for ChunkeyBert
exact hit@10 improvement: +0.39
bootstrap 95% CI for exact F1@10 delta: [0.0487, 0.0802]
bootstrap 95% CI for stemmed F1@10 delta: [0.0554, 0.0876]

This is enough to say, with reasonable confidence, that ChunkeyBert is materially stronger than plain KeyBERT on very long scientific documents under present-keyphrase matching. It is not evidence that ChunkeyBert is universally better across all keyword extraction settings or short-document benchmarks.

For more detail, see:

The generated markdown report and SVG charts are written to out/benchmarks/ when the benchmark runner is executed.

Contribution and feedback

Contributions and feedback are most welcome. Please see CONTRIBUTING.md for further details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 12, 2026

0.2.0

Jun 7, 2024

0.1.0

Jun 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkey_bert-1.0.0.tar.gz (15.8 kB view details)

Uploaded Apr 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunkey_bert-1.0.0-py3-none-any.whl (13.1 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file chunkey_bert-1.0.0.tar.gz.

File metadata

Download URL: chunkey_bert-1.0.0.tar.gz
Upload date: Apr 12, 2026
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Linux/6.17.0-20-generic

File hashes

Hashes for chunkey_bert-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`575576057cf0334acca1ce129a691c32f88e3dc49ee8621ff43a985cac5f5377`
MD5	`afc454d5f7a2f899d8e7cbc1a4652008`
BLAKE2b-256	`cdeec122f5c6c80126d912a889e08c5d67df43c9568ec69616652c90bf84caed`

See more details on using hashes here.

File details

Details for the file chunkey_bert-1.0.0-py3-none-any.whl.

File metadata

Download URL: chunkey_bert-1.0.0-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Linux/6.17.0-20-generic

File hashes

Hashes for chunkey_bert-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2b13161d6f714aa578cfb3816d3d3f5916209db27b45d6b381cb19191a60d5e`
MD5	`a0bc4c83324b8cd9d9f5d71d013a3292`
BLAKE2b-256	`1c3f19614665ce9e7f5f4bfaa537c668523a651fd11b1890b1804b27a6998990`

See more details on using hashes here.

chunkey-bert 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChunkeyBERT - Unsupervised Keyword Extraction from Long Documents

Overview

Installation

Use Cases

Details

How does ChunkeyBERT differs to KeyBERT?

Flexibility in keyword extraction

Batching and GPU support

Compatible with KeyBERT return values

Usage

Experimental results

Benchmarks

Current benchmark snapshot

Contribution and feedback

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes