Lightweight Python library that uses sentence embeddings to create naturally coherent segments of text akin to paragraphs.
Project description
cohesive
cohesive is a lightweight segmenter that uses sentence embeddings to split documents into naturally coherent segments akin to paragraphs.
Installation
You can install 'cohesive' using pip:
pip install cohesive
Using cohesive
To start using cohesive, import the CohesiveTextSegmenter and create a new instance:
from cohesive import CohesiveTextSegmenter
# By default, cohesive uses paraphrase-MiniLM-L6-v2, which produces good
# results, but you can specify any SentenceTransformer model.
# For example, lets use all-MiniLM-L6-v2 ...
cohesive = CohesiveTextSegmenter("all-MiniLM-L6-v2")
# Then, all you need to do is call the generate_tiles method and pass in an array of sentences.
cohesive.generate_segments(sentences)
Finetuning cohesive
cohesive users can finetune several parameters, which all impact the final segmentation results in different ways. Here is a quick summary:
-
Alpha:
- Role: Used as a weight in combining global and local similarities.
- Impact: A higher alpha places more emphasis on global similarities, making the segmentation more influenced by overall similarity between sentences. Conversely, a lower alpha gives more weight to local similarities within the context window.
- Default: 0.5
-
Context Window:
- Role: Determines the size of the context window used to calculate local similarities between sentences.
- Impact: A smaller context window focuses on very close neighbors, capturing fine-grained local relationships. This may be suitable for documents where coherence is established within a small span of sentences. On the other hand, a larger context window considers a broader context, capturing longer-range dependencies and global patterns.
- Default: 6
-
Decay:
- Role: Used to calculate decay factors based on distances between sentence indices when combining similarities.
- Impact: A higher decay results in faster decay with increasing distance between sentences. This means that sentences further apart contribute less to the overall similarity, emphasizing local cohesion. A lower decay allows for longer-range dependencies to impact the segmentation.
- Default: 0.8
-
Resolution:
- Role: Used in the community detection algorithm.
- Impact: A higher resolution value leads to more and smaller communities, potentially yielding finer-grained segmentation. Conversely, a lower resolution results in fewer and larger communities, offering a more consolidated segmentation.
- Default: 1.0
To modify the parameters, either pass in the appropriate parameter name and value when you call the create_segments method, or use the dedicated finetune_params function:
# Via create_segments
cohesive.create_segments(sentences, context_window=3)
# Via finetune_params
cohesive.finetune_params(alpha=1, decay=0.2)
Note: Any update to the parameters is stateful.
At any time you can view the current parameters by calling the get_params method.
Viewing the segments
When create_segments has finished, cohesive will print a summary of the total number of segments that were created.
To view the segments, simply call the print_segments method.
You can also view the start and end indices of sentences with a segment via the print_segment_boundaries function.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cohesive-0.1.2.tar.gz
.
File metadata
- Download URL: cohesive-0.1.2.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db466f76a2b4454abb16d3fe039d4a6a2a9872e94464b3f0a3f6eb1a6a3cf584 |
|
MD5 | 6181a873cadcb6a0fdf67df951c4b94f |
|
BLAKE2b-256 | 8c488ce4dd9a59dbb508c0b2b6a7904de7f2e443225191adee239cac078d782d |
File details
Details for the file cohesive-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: cohesive-0.1.2-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae1abc0e2d7f07b6e56d90eb97554e794550cc2e68b7804addd3a91f3ddef7b4 |
|
MD5 | 71fbaf812e0d4fd28e86f47f6f7f5500 |
|
BLAKE2b-256 | fbc5dd6edbe769bc97a8e2740e117d1010eb6a3a75c4abe50f78fab79a8701ea |