Semantic segmentation and topic boundary detection
Project description
pypi-AgENdGVzdC5weXBpLm9yZwIkY2VkNzlmNmEtZmViYi00OTM4LTlhZTgtNDAyNWJkMWFlMjVlAAIqWzMsImMxYTE0ZWY4LTBjNWEtNDg5ZS04YWUyLTE1OWI2YmIwZDQyYyJdAAAGIMupex0Wxu515x2zMhXvUO7sEcVGMPdMQ0DSE1scsU6C
Table of Contents
- Project Goal
- How Do We Determine Semantic Similarity?
- Cosine Similarity Example
- Sliding window mechanism
- Challenge
- Coding Plan
- File Structure
- TODO
Semantic Chunker
🚀 Project Goal The goal of this project is to automatically find topic‑based borders within a document. It identifies points where the semantic content of the text shifts noticeably by using cosine similarity and a sliding‑window mechanism.
How do we determine whether sentences have similar meaning?
Natural Language Processing (NLP) models are trained on massive amounts of text and convert the meaning of words and sentences into mathematical representations called vectors. These vectors can be thought of as points located in a multidimensional coordinate space. using this models, when we provide an input word, it can return its numerical representation in the form of a vector. We can then provide a second word, and the library will generate another vector. These two numerical representations (vectors) allow us to perform mathematical operations such as subtraction, addition, etc.
For example, if we take the vector of the word “king”, subtract the vector of “man”, and then add the vector of “woman”,
and finally convert the resulting vector back into a word, we obtain “queen”.
Figure 1: A geometric illustration of word‑vector relationships showing how semantic transformations appear in vector space.
Using this approach, we can also find synonyms and other semantically related words.
We can also convert sentences into vectors and compare them to understand how similar they are in meaning. To do this, we use cosine similarity. Words with similar meaning end up close to each other. Words with different meaning end up far apart.
Figure 2: Conceptual explanation of cosine similarity as the angle between vectors.
So What Does Cosine Similarity Do? Cosine similarity measures how similar two word‑vectors are by checking the angle between them. Think of each word as an arrow (a vector) in a many‑dimensional space:
If two arrows point in almost the same direction, their meanings are similar If they point in different directions, their meanings are different
Mathematically, cosine similarity looks at the cosine of the angle between the vectors.
Cosine Similarity Values
Cosine similarity always returns a value between –1 and 1:
1.0 → words mean almost the same 0.0 → words are unrelated.
lets see an example of cosine similarity between two sentences:
1 from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
2 model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
3 sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
4 embeddings = model.encode(sentences)
5 first_sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]# cosine similarity between index 0 and index 1
second_sim = cosine_similarity([embeddings[1]], [embeddings[2]])[0][0]# cosine similarity between index 1 and index 2
print(first_sim)
print(second_sim)
1 import libraries
2 initialize model
3 initialize sample input sentences
4 encode sentences to get embeddings
5 find cosine similarities
it will output
0.81397283
0.15795702
so the first two sentences are semantically similar (both talk about the weather), while the third sentence is quite different (talks about driving to a stadium).
we can visualize this as follows:
Figure 3: Basic example — strong similarity (S0–S1) vs. weak similarity (S1–S2).
so the sentences with index 0 ("The weather is lovely today.") relationship to the sentence with the index 1 ("It's so sunny outside!") is strong, meanwhile the relationship of sentence with index 1 ("The weather is lovely today.") to the sentence with index 2 ( "He drove to the stadium.") is weak.
such visualization helps when we have a lot of sentences and we want to quickly see where the topic changes.
example
Figure 4: visualization of cos similarity across a large number of sentences
Sliding Window Mechanism
"so far so good" , but comparing every sentence to the neighbor sentence sometimes is not enough to detect topic changes. Sometimes adjacent sentences may belong to the same topic, but their cosine similarity is low. For example: "The cat is on the roof." "the children are going to school."
Or the opposite situation: two sentences at the boundary between topics may belong to different topics, but their cosine similarity is high. For example: “The cat is on the roof.” “The dog is on the roof.” These two sentences may be from completely different topics (for instance, one about a family’s pets and the other about guard dogs), but they will have high cosine similarity because of the shared phrase “on the roof”. This results in a misleading similarity plot:
Figure 5: Example of noisy results with many sentences.
To solve this problem, we need to include nearby sentences by merging them into a single context. For example:- “They were a wonderful big family; grandpa taught them to be kind to everyone.”
- “They had several animals — cows, dogs, chickens — and the children treated them well.”
- “The cat is on the roof.”
- “The children are going to school.”
- “The cat was watching them leave, saying goodbye with his eyes.”
- “One of the children noticed the cat and waved at him.”
If we take 3 sentences to the left and 3 sentences to the right of the current sentence and compare cosine similarity between these windows, we can better understand whether a topic shift occurs. In the example above, we can see that the first three sentences are related to each other because they describe a family with animals, and thus their cosine similarity will be high.
** Note: We do not expect to find the exact boundary position. Instead, we consider a prediction correct if the true boundary lies within a tolerance window of ±3 sentences around the detected boundary.
Challange
We have a list of models, and we don’t know which window size and which min_gap value will work best for each model. This means we need to test all combinations of these parameters and evaluate their performance. Additionally, there are libraries such as LLaMA-based semantic segmentation tools that can also detect topic boundaries. We want to compare our results against these baselines and see whether our method can perform better. The idea is to run our algorithm:
- for each model,
- for each window size,
- and for each min_gap value,
and then evaluate the results using metrics such as:
- the percentage of correctly detected boundaries,
- and visualizations that allow us to compare different configurations side-by-side.
We use news articles from the WDR NRW archive, where each file contains five news stories. For every news story, we have ground‑truth annotations that mark the exact topic boundaries. We compare our predicted boundaries with these annotations and measure how accurately each model and parameter combination performs.
Coding plan
Next, we describe how we prepare the data, run algorithm, evaluate the predictions, save the results, visualize them, and finally summarize our findings.
data preparation
Test data preparation
The detailed description of test data preparation process is not very important. We start with the original JSON files, parse them, and then reconstruct the cleaned version back into JSON format. All processed files are stored in the data/ directory. For debugging purposes, the same data is also converted into .txt format. In these text files:
-
every sentence is indexed,
-
topic boundaries are marked with an asterisk *.
These debug-friendly files are located in computer/content/.
Algoritm input
In total, we use 13 different models. For each model, we test 5 window sizes and 5 gap values, which results in: 13 × 5 × 5 = 325 possible parameter combinations. These combinations are evaluated independently, allowing us to analyze how each model behaves under different configurations.
*see main.py
Running Algorithm
Sliding Window Mechanism Implementation
In the previous example, we took three sentences and compared them with each other. In this example, we will use more sentences and adapt our code accordingly, but the main idea will remain the same. in the file
slid_win.py
is the main code of the sliding window mechanism.
def segment_topics_window(
blocks,
window_size,
min_gap,
model
):
1 embeddings = model.encode(blocks)
2 scores = []
indices = []
3 for i in range(window_size, len(blocks) - window_size):
4 left = embeddings[i - window_size:i]
right = embeddings[i:i + window_size]
5 left_mean = optimize_embddings(left)
right_mean = optimize_embddings(right)
6 sim = cosine_similarity(left_mean, right_mean)[0][0]
7 scores.append(sim)
indices.append(i)
8 threshold = np.mean(scores) - 1.2 * np.std(scores)
boundaries = []
last = 0
9 for idx, score in zip(indices, scores):
if score < threshold and idx - last >= min_gap:
boundaries.append(idx)
last = idx
return boundaries, scores, indices
1 - Encode sentences
2 - Initialize arrays to store the similarity scores and the sentence indices.
3 - Iterate through the sentences using a loop with a step size equal to window_size.
4 - Take combined left and right parts of sentences
5 - Apply embedding optimization — this helps reduce noise and capture the overall topic of each window more robustly.
6 - Compute the cosine similarity.
7 - Store the similarity scores and the corresponding indices in the arrays.
8 - Compute a dynamic threshold based on the distribution of similarity scores. This helps identify unusually low similarity values that may indicate potential topic shifts.
9 - Detect topic boundaries where the similarity score falls below the threshold and the distance from the last detected boundary is at least min_gap. This prevents overly dense or noisy boundary detection.
Main Code
The hardest part is over — from here, it’s all smooth sailing.
def compute(
window_size,
min_gap,
model_name):
model = SentenceTransformer(model_name)
combination_name = f"model_{model_name}_w_{window_size}_m_{min_gap}"
1 for i in range(0, 100):
file_name = f"merged_filtered_{i}.json"
2 blocks, expected_boundary, source_count, _ = extract_texts_and_write_to_file(file_name, False)
3 boundaries, scores, indices = segment_topics_window(blocks, ...)
4 plot_sliding_window(...)
5 save_pair_to_csv(...)
6 df = pd.read_csv(get_path_for_csv(combination_name), usecols=[MATCH_PERCENTAGE])
7 save_result_tocsv(combination_name, df.mean().iloc[0])
1 after defining model and combination names, we loop through 100 test samples,
2 we extract the text blocks and expected boundaries
3 this step does need explanation, we described it in detail above.
4 we generate and save visualization of the sliding window results. This helps us to visually inspect why and where the algorithm decided that the topic changes.
5 we save per-sample results to CSV
6-7 after processing all samples for the current combination, we count how many boundaries were correctly detected and save the average percentage to a final CSV file for later analysis.
Visualization
for each test case we generate such a visualization:
Figure 6: Sliding‑window similarity plot — blue line shows similarity scores, green dashed lines s how ground truth, red points show detected boundaries.
the red points represent the detected boundaries, the blue line represents the similarity scores across the text, and the vertical green dashed lines indicate the expected boundaries (ground truth).it saved in the result folder with subfoler named after the model and parameter combination. for example this one is saved in computer/result/model_all-MiniLM-L12-v2_w_3_m_3/merged_filtered_4/merged_filtered_4.json.png
The source code for visualiazation is in
computer/plotter.py
Results Evaluation
Model Results
After each run of the algorithm — for every model and every parameter configuration — we save the results to a CSV file. The files are stored in the result/ directory and each one is named according to the model and the parameters used. For example:
model_paraphrase-multilingual-mpnet-base-v2_w_3_m_3.csv
Figure 7: Example of per‑model and per‑parameter evaluation results stored in CSV format.
The structure of this file includes the following columns:
- the name of the test file - File,
- the expected boundaries - boundary,
- the predicted boundaries - possible_breaks,
- a dictionary indicating whether each boundary was detected correctly - matches2,
- and the overall match percentage - percentage2.
The code responsible for saving the results to a CSV file is located in slid_win.py inside the function save_pair_to_csv(...). To keep the documentation simple, we do not include the full implementation here, but the function itself is straightforward. And if needed, feel free to ask an AI for help — (p.s. that’s where I copied it from myself :)).
3rd party library results
3rd‑Party Library Results We also tested third‑party libraries for semantic segmentation, specifically the LLaMA‑based implementations SemanticSplitterNodeParser and SemanticDoubleMergingSplitterNodeParser. We used the same test dataset, and the results were saved in CSV files with the same structure as our own algorithm’s output. However, these libraries did not perform well. Although they detected all real boundaries, they also generated a large number of incorrect ones, which significantly reduced their overall usefulness.
Overall Evaluation
After running all combinations of models and parameters, we compiled the results into a final CSV file that summarizes the performance of each configuration. This allows us to compare different models and parameter settings side by side and identify which ones are most effective at detecting topic boundaries in our test dataset.
Figure 9: Comparison of all model and parameter combinations, showing boundary‑detection accuracy.
Our top performers with a window size of 3 and a min_gap of 3 were the models paraphrase-multilingual-mpnet-base-v2 and distiluse-base-multilingual-cased-v1.
File Structure
Figure 10: Directory layout.
-
Artikel_WDR_NRW/This folder contains raw test data. After extraction and text cleaning, the processed data is saved into thedata/folder. -
data/Stores the cleaned and preprocessed data generated from the raw inputs. This folder is used as the main input source for the processing pipeline. -
computer/Contains the core application logic. All main processing steps are implemented here. -
content/,result/andgrafic/These folders are primarily used for debugging and inspection purposes. All output data is classified and stored in one of these folders depending on its type. -
text_util/andutil/Contain helper and utility functions, including:-
Text cleaning and normalization
-
Format conversion
-
Shared helper logic used across the project
-
TODO
-
Fine‑tune the model — Hugging Face provides tools to further train embedding models on custom datasets, which may significantly improve boundary‑detection accuracy for our domain.
-
Experiment with alternative approaches such as agglomerative clustering — instead of using a sliding window, clustering algorithms could group semantically similar sentences and identify topic boundaries between clusters.
-
Extend algorithm to find the exact boundary position. We want to extend the existing code so that it can identify the boundary more precisely. To do this, we use the following approach: We have a predicted boundary X, and we know that the true boundary lies within a window of ±3 sentences around X. This means we can take the contextual text to the left of (X − 3) and compare it with each sentence in that window. Then we do the same with the contextual text to the right of (X + 3) and compare it with each sentence. This should produce a pattern similar to the one below: we see that the similarity values are high at first and then drop sharply — and for the right side it behaves in the opposite way. So the exect boundary will be at the point where the similarity drops (for the left context) and rises (for the right context).
Figure 11: Similarity between the left and right context and each sentence within the approximate boundary range.
Figure 12: If the left‑side similarities are low while the right‑side similarities are high, then the true boundary is likely located at (X − 3).
- If both sides show consistently high similarity, then the prediction is likely ambiguous. In this case, a more advanced approach (for example, using an OpenAI LLM) may be required to determine the exact boundary with higher accuracy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wdr_article_semantic_chunking_2-0.1.1.tar.gz.
File metadata
- Download URL: wdr_article_semantic_chunking_2-0.1.1.tar.gz
- Upload date:
- Size: 11.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a44833a5504e69d013fbc4a03b6aaf271877698ec7853f5b06a93dda8ee520d6
|
|
| MD5 |
07460a2f69241cbfecfc5a4d56a7464a
|
|
| BLAKE2b-256 |
b6b7591adbf6bb3fba9b811050bdb42500865d8273f0da00ce817d6b2c9c8e8c
|
File details
Details for the file wdr_article_semantic_chunking_2-0.1.1-py3-none-any.whl.
File metadata
- Download URL: wdr_article_semantic_chunking_2-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63036de4ab77ea66dac50d1d90691affbae94188c938cc49b62bdd8bbf0c7f03
|
|
| MD5 |
7bc3ed64cac4b005417af5c4982d10df
|
|
| BLAKE2b-256 |
81a397f6399a275ddc90e64f1f83371ba5978861885de133f08de52936c307d2
|