Skip to main content

Use sentence embeddings to create naturally coherent segments of text akin to paragraphs.

Project description

cohesive

cohesive is a lightweight segmenter that uses sentence embeddings to split documents into naturally coherent segments akin to paragraphs.

Installation

You can install cohesive using pip:

pip install cohesive

Using cohesive

To start using cohesive, import Cohesive and the relevant text embedding class. Choose from OpenAI, SentenceTransformers, Tensorflow, or Transformers:

from cohesive import Cohesive

# By default, cohesive uses the paraphrase-MiniLM-L6-v2 model, which produces good results, but you can pass the name of any model into the Cohesive constructor.
cohesive = Cohesive("msmarco-distilbert-cos-v5")

# Then, all you need to do is call the create_segments method and pass in an array of sentences.
cohesive.create_segments(sentences)

Finetuning cohesive

cohesive users can finetune several parameters, which all impact the final segmentation results in different ways. Here is a quick summary:

  • window_size: Sets the size of the context window for generating segments. Defaults to 4.
  • louvain_resolution: Used by the Louvain community detection algorithm to partition sentences into segments. Default is 1.
  • framework: The framework to use for calculating similarity scores. Choose between scipy and sklearn. Default is "scipy".
  • show_progress_bar: Flag to display the progress bar from sentence-transformers whilst generating embeddings. Defaults to False.
  • balanced_window: If True, the context window is split evenly between preceding and subsequent sentences, otherwise it only looks at subsequent sentences. Defaults to False.
  • exponential_scaling: Flag to use exponential scaling when calculating similarity scores. Defaults to False.
  • max_sentences_per_segment: Maximum number of sentences per segment. Default is None.

To modify the parameters, simply pass in the appropriate parameter name and value when you call the create_segments method:

# Via create_segments
cohesive.create_segments(sentences, window_size=3, exponential_scaling=True)

Viewing the segments

When create_segments has finished, cohesive will print a summary of the total number of segments that were created.

There are several methods for interacting with the generated segments.

# View a string representation of the consolidated Segment and Sentence objects
cohesive.segments

# List that contains the content of each segment.
cohesive.get_segment_contents()

# View the start and end indices of sentences within a segment.
cohesive.get_segment_boundaries()

# Print the contents of each segment to the console or Notebook.
cohesive.print_segment_contents()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cohesive-0.1.5.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

cohesive-0.1.5-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file cohesive-0.1.5.tar.gz.

File metadata

  • Download URL: cohesive-0.1.5.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.5.tar.gz
Algorithm Hash digest
SHA256 67caa553a523a2db9efbfe907291027b57ec2d91dc62e6e43862d01d891cebfd
MD5 3fbadd12f4a735b5ae4cea56ea43bf97
BLAKE2b-256 9bfad730fe32f0b812ffb6e10bccaa3c5425e2dff693b1d120720333f2d7ce01

See more details on using hashes here.

File details

Details for the file cohesive-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: cohesive-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9af6feee90879fedbb62c7dc825ac7e2a950602958a12055072368dc95849de9
MD5 b764856d3861fa6632bd5e728299e2e4
BLAKE2b-256 5663e6b0776b2434f2898b298a5ea800a5bf1847661c974572cfa696caea2e90

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page