Use sentence embeddings to create naturally coherent segments of text akin to paragraphs.
Project description
cohesive
cohesive is a lightweight segmenter that uses sentence embeddings to split documents into naturally coherent segments akin to paragraphs.
Installation
You can install cohesive using pip:
pip install cohesive
Using cohesive
To start using cohesive, simply import Cohesive and create a new instance of the client:
from cohesive import Cohesive
# By default, cohesive uses the paraphrase-MiniLM-L6-v2 model, which produces good
# results, but you can pass the name of any model into the Cohesive constructor.
cohesive = Cohesive("msmarco-distilbert-cos-v5")
# Then, all you need to do is call the create_segments method and pass in an
# array of sentences.
cohesive.create_segments(sentences)
At the present time, cohesive is only compatible with the sentence-transformers library but additional encoders will be added in the future.
Finetuning cohesive
cohesive users can finetune several parameters, which all impact the final segmentation results in different ways. Here is a quick summary:
- window_size: Sets the size of the context window for generating segments. Defaults to 4.
- louvain_resolution: Used by the Louvain community detection algorithm to partition sentences into segments. Default is 1.
- framework: The framework to use for calculating similarity scores. Choose between scipy and sklearn. Default is "scipy".
- show_progress_bar: Flag to display the progress bar from sentence-transformers whilst generating embeddings. Defaults to False.
- balanced_window: If True, the context window is split evenly between preceding and subsequent sentences, otherwise it only looks at subsequent sentences. Defaults to False.
- exponential_scaling: Flag to use exponential scaling when calculating similarity scores. Defaults to False.
- max_sentences_per_segment: Maximum number of sentences per segment. Default is None.
To modify the parameters, simply pass in the appropriate parameter name and value when you call the create_segments method:
# Via create_segments
cohesive.create_segments(sentences, window_size=3, exponential_scaling=True)
Viewing the segments
When create_segments has finished, cohesive will print a summary of the total number of segments that were created.
There are several methods for interacting with the generated segments.
# View a string representation of the consolidated Segment and Sentence objects
cohesive.segments
# List that contains the content of each segment.
cohesive.get_segment_contents()
# View the start and end indices of sentences within a segment.
cohesive.get_segment_boundaries()
# Print the contents of each segment to the console or Notebook.
cohesive.print_segment_contents()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cohesive-0.1.6.tar.gz
.
File metadata
- Download URL: cohesive-0.1.6.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a0669695782367f24f69c52a9a1f4b05ab2ced9140068ea857cfc34a61b9c71 |
|
MD5 | 43ca6c202f46bf5799b20c7b3a03e510 |
|
BLAKE2b-256 | fdbdb2f7536152c6392a6e7f533d14bf4ed6109ba62007316d11560d16494e8f |
File details
Details for the file cohesive-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: cohesive-0.1.6-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 223028072353733fd5efc9970a212b5a2425546f54eb2580ec25423b44e34b5b |
|
MD5 | d8d10a837c07dfbb6d3da21355c13fd7 |
|
BLAKE2b-256 | 3427dc63e7dc6b3f6677a69be1583c1cf6a61e13301ba20ab430161dfeb04e31 |