Semantic segmentation and topic boundary detection

Project description

pypi-AgENdGVzdC5weXBpLm9yZwIkY2VkNzlmNmEtZmViYi00OTM4LTlhZTgtNDAyNWJkMWFlMjVlAAIqWzMsImMxYTE0ZWY4LTBjNWEtNDg5ZS04YWUyLTE1OWI2YmIwZDQyYyJdAAAGIMupex0Wxu515x2zMhXvUO7sEcVGMPdMQ0DSE1scsU6C

Project Goal
How Do We Determine Semantic Similarity?
Cosine Similarity Example
Sliding window mechanism
Challenge
Coding Plan
File Structure
TODO

Semantic Chunker

🚀 Project Goal The goal of this project is to automatically find topic‑based borders within a document. It identifies points where the semantic content of the text shifts noticeably by using cosine similarity and a sliding‑window mechanism.

How do we determine whether sentences have similar meaning?

Natural Language Processing (NLP) models are trained on massive amounts of text and convert the meaning of words and sentences into mathematical representations called vectors. These vectors can be thought of as points located in a multidimensional coordinate space. using this models, when we provide an input word, it can return its numerical representation in the form of a vector. We can then provide a second word, and the library will generate another vector. These two numerical representations (vectors) allow us to perform mathematical operations such as subtraction, addition, etc.

For example, if we take the vector of the word “king”, subtract the vector of “man”, and then add the vector of “woman”, and finally convert the resulting vector back into a word, we obtain “queen”. SVG Image

Figure 1: A geometric illustration of word‑vector relationships showing how semantic transformations appear in vector space.

cos sim

Using this approach, we can also find synonyms and other semantically related words.

We can also convert sentences into vectors and compare them to understand how similar they are in meaning. To do this, we use cosine similarity. Words with similar meaning end up close to each other. Words with different meaning end up far apart.

cos sim

Figure 2: Conceptual explanation of cosine similarity as the angle between vectors.

So What Does Cosine Similarity Do? Cosine similarity measures how similar two word‑vectors are by checking the angle between them. Think of each word as an arrow (a vector) in a many‑dimensional space:

If two arrows point in almost the same direction, their meanings are similar If they point in different directions, their meanings are different

Mathematically, cosine similarity looks at the cosine of the angle between the vectors.

Cosine Similarity Values

Cosine similarity always returns a value between –1 and 1:

1.0 → words mean almost the same 0.0 → words are unrelated.

lets see an example of cosine similarity between two sentences:

1 from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

2 model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

3 sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

4 embeddings = model.encode(sentences)

5 first_sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]# cosine similarity between index 0 and index 1
second_sim = cosine_similarity([embeddings[1]], [embeddings[2]])[0][0]# cosine similarity between index 1 and index 2
print(first_sim)
print(second_sim)

1 import libraries

2 initialize model

3 initialize sample input sentences

4 encode sentences to get embeddings

5 find cosine similarities

it will output

0.81397283
0.15795702

so the first two sentences are semantically similar (both talk about the weather), while the third sentence is quite different (talks about driving to a stadium).

we can visualize this as follows:

cos sim

Figure 3: Basic example — strong similarity (S0–S1) vs. weak similarity (S1–S2).

so the sentences with index 0 ("The weather is lovely today.") relationship to the sentence with the index 1 ("It's so sunny outside!") is strong, meanwhile the relationship of sentence with index 1 ("The weather is lovely today.") to the sentence with index 2 ( "He drove to the stadium.") is weak.

such visualization helps when we have a lot of sentences and we want to quickly see where the topic changes.

example SVG Image

Figure 4: visualization of cos similarity across a large number of sentences

Sliding Window Mechanism

"so far so good" , but comparing every sentence to the neighbor sentence sometimes is not enough to detect topic changes. Sometimes adjacent sentences may belong to the same topic, but their cosine similarity is low. For example: "The cat is on the roof." "the children are going to school."

Or the opposite situation: two sentences at the boundary between topics may belong to different topics, but their cosine similarity is high. For example: “The cat is on the roof.” “The dog is on the roof.” These two sentences may be from completely different topics (for instance, one about a family’s pets and the other about guard dogs), but they will have high cosine similarity because of the shared phrase “on the roof”. This results in a misleading similarity plot:

SVG Image

Figure 5: Example of noisy results with many sentences.

To solve this problem, we need to include nearby sentences by merging them into a single context. For example:

“They were a wonderful big family; grandpa taught them to be kind to everyone.”
“They had several animals — cows, dogs, chickens — and the children treated them well.”
“The cat is on the roof.”
“The children are going to school.”
“The cat was watching them leave, saying goodbye with his eyes.”
“One of the children noticed the cat and waved at him.”

If we take 3 sentences to the left and 3 sentences to the right of the current sentence and compare cosine similarity between these windows, we can better understand whether a topic shift occurs. In the example above, we can see that the first three sentences are related to each other because they describe a family with animals, and thus their cosine similarity will be high.

** Note: We do not expect to find the exact boundary position. Instead, we consider a prediction correct if the true boundary lies within a tolerance window of ±3 sentences around the detected boundary.

Challange

We have a list of models, and we don’t know which window size and which min_gap value will work best for each model. This means we need to test all combinations of these parameters and evaluate their performance. Additionally, there are libraries such as LLaMA-based semantic segmentation tools that can also detect topic boundaries. We want to compare our results against these baselines and see whether our method can perform better. The idea is to run our algorithm:

for each model,
for each window size,
and for each min_gap value,

and then evaluate the results using metrics such as:

the percentage of correctly detected boundaries,
and visualizations that allow us to compare different configurations side-by-side.

We use news articles from the WDR NRW archive, where each file contains five news stories. For every news story, we have ground‑truth annotations that mark the exact topic boundaries. We compare our predicted boundaries with these annotations and measure how accurately each model and parameter combination performs.

Coding plan

Next, we describe how we prepare the data, run algorithm, evaluate the predictions, save the results, visualize them, and finally summarize our findings.

data preparation

Test data preparation

The detailed description of test data preparation process is not very important. We start with the original JSON files, parse them, and then reconstruct the cleaned version back into JSON format. All processed files are stored in the data/ directory. For debugging purposes, the same data is also converted into .txt format. In these text files:

every sentence is indexed,
topic boundaries are marked with an asterisk *.

These debug-friendly files are located in computer/content/.

Algoritm input

In total, we use 13 different models. For each model, we test 5 window sizes and 5 gap values, which results in: 13 × 5 × 5 = 325 possible parameter combinations. These combinations are evaluated independently, allowing us to analyze how each model behaves under different configurations.

*see main.py

Running Algorithm

Sliding Window Mechanism Implementation

In the previous example, we took three sentences and compared them with each other. In this example, we will use more sentences and adapt our code accordingly, but the main idea will remain the same. in the file

slid_win.py

is the main code of the sliding window mechanism.

def segment_topics_window(
        blocks,
        window_size,
        min_gap,
        model
):
 1   embeddings = model.encode(blocks)

 2  scores = []
    indices = []

 3   for i in range(window_size, len(blocks) - window_size):
 4      left = embeddings[i - window_size:i]
        right = embeddings[i:i + window_size]

 5      left_mean = optimize_embddings(left)
        right_mean = optimize_embddings(right)

 6      sim = cosine_similarity(left_mean, right_mean)[0][0]
 7      scores.append(sim)
        indices.append(i)

 8   threshold = np.mean(scores) - 1.2 * np.std(scores)

    boundaries = []
    last = 0

 9  for idx, score in zip(indices, scores):
        if score < threshold and idx - last >= min_gap:
            boundaries.append(idx)
            last = idx

    return boundaries, scores, indices

1 - Encode sentences

2 - Initialize arrays to store the similarity scores and the sentence indices.

3 - Iterate through the sentences using a loop with a step size equal to window_size.

4 - Take combined left and right parts of sentences

5 - Apply embedding optimization — this helps reduce noise and capture the overall topic of each window more robustly.

6 - Compute the cosine similarity.

7 - Store the similarity scores and the corresponding indices in the arrays.

8 - Compute a dynamic threshold based on the distribution of similarity scores. This helps identify unusually low similarity values that may indicate potential topic shifts.

9 - Detect topic boundaries where the similarity score falls below the threshold and the distance from the last detected boundary is at least min_gap. This prevents overly dense or noisy boundary detection.

Main Code

The hardest part is over — from here, it’s all smooth sailing.

def compute(
        window_size,
        min_gap,
        model_name):
    model = SentenceTransformer(model_name)
    combination_name = f"model_{model_name}_w_{window_size}_m_{min_gap}"
  1  for i in range(0, 100):
        file_name = f"merged_filtered_{i}.json"
  2     blocks, expected_boundary, source_count, _ = extract_texts_and_write_to_file(file_name, False)

  3     boundaries, scores, indices = segment_topics_window(blocks, ...)

  4      plot_sliding_window(...)

  5      save_pair_to_csv(...)
  
  6 df = pd.read_csv(get_path_for_csv(combination_name), usecols=[MATCH_PERCENTAGE])
  7 save_result_tocsv(combination_name, df.mean().iloc[0])

1 after defining model and combination names, we loop through 100 test samples,

2 we extract the text blocks and expected boundaries

3 this step does need explanation, we described it in detail above.

4 we generate and save visualization of the sliding window results. This helps us to visually inspect why and where the algorithm decided that the topic changes.

5 we save per-sample results to CSV

6-7 after processing all samples for the current combination, we count how many boundaries were correctly detected and save the average percentage to a final CSV file for later analysis.

Visualization

for each test case we generate such a visualization: SVG Image

Figure 6: Sliding‑window similarity plot — blue line shows similarity scores, green dashed lines s how ground truth, red points show detected boundaries.

the red points represent the detected boundaries, the blue line represents the similarity scores across the text, and the vertical green dashed lines indicate the expected boundaries (ground truth).

it saved in the result folder with subfoler named after the model and parameter combination. for example this one is saved in computer/result/model_all-MiniLM-L12-v2_w_3_m_3/merged_filtered_4/merged_filtered_4.json.png

The source code for visualiazation is in

computer/plotter.py

Results Evaluation

Model Results

After each run of the algorithm — for every model and every parameter configuration — we save the results to a CSV file. The files are stored in the result/ directory and each one is named according to the model and the parameters used. For example:

model_paraphrase-multilingual-mpnet-base-v2_w_3_m_3.csv

SVG Image

Figure 7: Example of per‑model and per‑parameter evaluation results stored in CSV format.

The structure of this file includes the following columns:

the name of the test file - File,
the expected boundaries - boundary,
the predicted boundaries - possible_breaks,
a dictionary indicating whether each boundary was detected correctly - matches2,
and the overall match percentage - percentage2.

The code responsible for saving the results to a CSV file is located in slid_win.py inside the function save_pair_to_csv(...). To keep the documentation simple, we do not include the full implementation here, but the function itself is straightforward. And if needed, feel free to ask an AI for help — (p.s. that’s where I copied it from myself :)).

3rd party library results

3rd‑Party Library Results We also tested third‑party libraries for semantic segmentation, specifically the LLaMA‑based implementations SemanticSplitterNodeParser and SemanticDoubleMergingSplitterNodeParser. We used the same test dataset, and the results were saved in CSV files with the same structure as our own algorithm’s output. However, these libraries did not perform well. Although they detected all real boundaries, they also generated a large number of incorrect ones, which significantly reduced their overall usefulness.

Overall Evaluation

After running all combinations of models and parameters, we compiled the results into a final CSV file that summarizes the performance of each configuration. This allows us to compare different models and parameter settings side by side and identify which ones are most effective at detecting topic boundaries in our test dataset.

SVG Image

Figure 9: Comparison of all model and parameter combinations, showing boundary‑detection accuracy.

Our top performers with a window size of 3 and a min_gap of 3 were the models paraphrase-multilingual-mpnet-base-v2 and distiluse-base-multilingual-cased-v1.

File Structure

SVG Image

Figure 10: Directory layout.

Artikel_WDR_NRW/ This folder contains raw test data. After extraction and text cleaning, the processed data is saved into the data/ folder.
data/ Stores the cleaned and preprocessed data generated from the raw inputs. This folder is used as the main input source for the processing pipeline.
computer/ Contains the core application logic. All main processing steps are implemented here.
content/, result/ and grafic/ These folders are primarily used for debugging and inspection purposes. All output data is classified and stored in one of these folders depending on its type.
text_util/ and util/ Contain helper and utility functions, including:
- Text cleaning and normalization
- Format conversion
- Shared helper logic used across the project

TODO

Fine‑tune the model — Hugging Face provides tools to further train embedding models on custom datasets, which may significantly improve boundary‑detection accuracy for our domain.
Experiment with alternative approaches such as agglomerative clustering — instead of using a sliding window, clustering algorithms could group semantically similar sentences and identify topic boundaries between clusters.
Extend algorithm to find the exact boundary position. We want to extend the existing code so that it can identify the boundary more precisely. To do this, we use the following approach: We have a predicted boundary X, and we know that the true boundary lies within a window of ±3 sentences around X. This means we can take the contextual text to the left of (X − 3) and compare it with each sentence in that window. Then we do the same with the contextual text to the right of (X + 3) and compare it with each sentence. This should produce a pattern similar to the one below: we see that the similarity values are high at first and then drop sharply — and for the right side it behaves in the opposite way. So the exect boundary will be at the point where the similarity drops (for the left context) and rises (for the right context).

Left figure Right figure

Figure 11: Similarity between the left and right context and each sentence within the approximate boundary range.

Left figure Right figure

Figure 12: If the left‑side similarities are low while the right‑side similarities are high, then the true boundary is likely located at (X − 3).

If both sides show consistently high similarity, then the prediction is likely ambiguous. In this case, a more advanced approach (for example, using an OpenAI LLM) may be required to determine the exact boundary with higher accuracy.

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Feb 20, 2026

0.1.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wdr_article_semantic_chunking_2-0.1.1.tar.gz (11.4 MB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wdr_article_semantic_chunking_2-0.1.1-py3-none-any.whl (11.6 MB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file wdr_article_semantic_chunking_2-0.1.1.tar.gz.

File metadata

Download URL: wdr_article_semantic_chunking_2-0.1.1.tar.gz
Upload date: Feb 20, 2026
Size: 11.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wdr_article_semantic_chunking_2-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a44833a5504e69d013fbc4a03b6aaf271877698ec7853f5b06a93dda8ee520d6`
MD5	`07460a2f69241cbfecfc5a4d56a7464a`
BLAKE2b-256	`b6b7591adbf6bb3fba9b811050bdb42500865d8273f0da00ce817d6b2c9c8e8c`

See more details on using hashes here.

File details

Details for the file wdr_article_semantic_chunking_2-0.1.1-py3-none-any.whl.

File metadata

Download URL: wdr_article_semantic_chunking_2-0.1.1-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 11.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wdr_article_semantic_chunking_2-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63036de4ab77ea66dac50d1d90691affbae94188c938cc49b62bdd8bbf0c7f03`
MD5	`7bc3ed64cac4b005417af5c4982d10df`
BLAKE2b-256	`81a397f6399a275ddc90e64f1f83371ba5978861885de133f08de52936c307d2`

See more details on using hashes here.

wdr-article-semantic-chunking-2 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Table of Contents

Semantic Chunker

How do we determine whether sentences have similar meaning?

Cosine Similarity Values

Sliding Window Mechanism

Challange

Coding plan

Test data preparation

Algoritm input

Sliding Window Mechanism Implementation

Main Code

Model Results

3rd party library results

Overall Evaluation

File Structure

TODO

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes