Skip to main content

Use sentence embeddings to create naturally coherent segments of text akin to paragraphs.

Project description

cohesive

cohesive is a lightweight segmenter that uses sentence embeddings to split documents into naturally coherent segments akin to paragraphs.

Installation

You can install cohesive using pip:

pip install cohesive

Using cohesive

To start using cohesive, simply import Cohesive and create a new instance of the client:

from cohesive import Cohesive

# By default, cohesive uses the paraphrase-MiniLM-L6-v2 model, which produces good
# results, but you can pass the name of any model into the Cohesive constructor.
cohesive = Cohesive("msmarco-distilbert-cos-v5")

# Then, all you need to do is call the create_segments method and pass in an
# array of sentences.
cohesive.create_segments(sentences)

At the present time, cohesive is only compatible with the sentence-transformers library but additional encoders will be added in the future.

Finetuning cohesive

cohesive users can finetune several parameters, which all impact the final segmentation results in different ways. Here is a quick summary:

  • window_size: Sets the size of the context window for generating segments. Defaults to 4.
  • louvain_resolution: Used by the Louvain community detection algorithm to partition sentences into segments. Default is 1.
  • framework: The framework to use for calculating similarity scores. Choose between scipy and sklearn. Default is "scipy".
  • show_progress_bar: Flag to display the progress bar from sentence-transformers whilst generating embeddings. Defaults to False.
  • balanced_window: If True, the context window is split evenly between preceding and subsequent sentences, otherwise it only looks at subsequent sentences. Defaults to False.
  • exponential_scaling: Flag to use exponential scaling when calculating similarity scores. Defaults to False.
  • max_sentences_per_segment: Maximum number of sentences per segment. Default is None.

To modify the parameters, simply pass in the appropriate parameter name and value when you call the create_segments method:

# Via create_segments
cohesive.create_segments(sentences, window_size=3, exponential_scaling=True)

Viewing the segments

When create_segments has finished, cohesive will print a summary of the total number of segments that were created.

There are several methods for interacting with the generated segments.

# View a string representation of the consolidated Segment and Sentence objects
cohesive.segments

# List that contains the content of each segment.
cohesive.get_segment_contents()

# View the start and end indices of sentences within a segment.
cohesive.get_segment_boundaries()

# Print the contents of each segment to the console or Notebook.
cohesive.print_segment_contents()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cohesive-0.1.6.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

cohesive-0.1.6-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file cohesive-0.1.6.tar.gz.

File metadata

  • Download URL: cohesive-0.1.6.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.6.tar.gz
Algorithm Hash digest
SHA256 2a0669695782367f24f69c52a9a1f4b05ab2ced9140068ea857cfc34a61b9c71
MD5 43ca6c202f46bf5799b20c7b3a03e510
BLAKE2b-256 fdbdb2f7536152c6392a6e7f533d14bf4ed6109ba62007316d11560d16494e8f

See more details on using hashes here.

File details

Details for the file cohesive-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: cohesive-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 223028072353733fd5efc9970a212b5a2425546f54eb2580ec25423b44e34b5b
MD5 d8d10a837c07dfbb6d3da21355c13fd7
BLAKE2b-256 3427dc63e7dc6b3f6677a69be1583c1cf6a61e13301ba20ab430161dfeb04e31

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page