Skip to main content

Lightweight Python library that uses sentence embeddings to create naturally coherent segments of text akin to paragraphs.

Project description

cohesive

cohesive is a lightweight segmenter that uses sentence embeddings to split documents into naturally coherent segments akin to paragraphs.

Installation

You can install 'cohesive' using pip:

pip install cohesive

Using cohesive

To start using cohesive, import the CohesiveTextSegmenter and create a new instance:

from cohesive import CohesiveTextSegmenter

# By default, cohesive uses paraphrase-MiniLM-L6-v2, which produces good
# results, but you can specify any SentenceTransformer model.
# For example, lets use all-MiniLM-L6-v2 ...
cohesive = CohesiveTextSegmenter("all-MiniLM-L6-v2")

# Then, all you need to do is call the generate_tiles method and pass in an array of sentences.
cohesive.generate_segments(sentences)

Finetuning cohesive

cohesive users can finetune several parameters, which all impact the final segmentation results in different ways. Here is a quick summary:

  • Alpha:

    • Role: Used as a weight in combining global and local similarities.
    • Impact: A higher alpha places more emphasis on global similarities, making the segmentation more influenced by overall similarity between sentences. Conversely, a lower alpha gives more weight to local similarities within the context window.
    • Default: 0.5
  • Context Window:

    • Role: Determines the size of the context window used to calculate local similarities between sentences.
    • Impact: A smaller context window focuses on very close neighbors, capturing fine-grained local relationships. This may be suitable for documents where coherence is established within a small span of sentences. On the other hand, a larger context window considers a broader context, capturing longer-range dependencies and global patterns.
    • Default: 6
  • Decay:

    • Role: Used to calculate decay factors based on distances between sentence indices when combining similarities.
    • Impact: A higher decay results in faster decay with increasing distance between sentences. This means that sentences further apart contribute less to the overall similarity, emphasizing local cohesion. A lower decay allows for longer-range dependencies to impact the segmentation.
    • Default: 0.8
  • Resolution:

    • Role: Used in the community detection algorithm.
    • Impact: A higher resolution value leads to more and smaller communities, potentially yielding finer-grained segmentation. Conversely, a lower resolution results in fewer and larger communities, offering a more consolidated segmentation.
    • Default: 1.0

To modify the parameters, either pass in the appropriate parameter name and value when you call the create_segments method, or use the dedicated finetune_params function:

# Via create_segments
cohesive.create_segments(sentences, context_window=3)

# Via finetune_params
cohesive.finetune_params(alpha=1, decay=0.2)

Note: Any update to the parameters is stateful.

At any time you can view the current parameters by calling the get_params method.

Viewing the segments

When create_segments has finished, cohesive will print a summary of the total number of segments that were created.

To view the segments, simply call the print_segments method.

You can also view the start and end indices of sentences with a segment via the print_segment_boundaries function.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cohesive-0.1.2.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

cohesive-0.1.2-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file cohesive-0.1.2.tar.gz.

File metadata

  • Download URL: cohesive-0.1.2.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.2.tar.gz
Algorithm Hash digest
SHA256 db466f76a2b4454abb16d3fe039d4a6a2a9872e94464b3f0a3f6eb1a6a3cf584
MD5 6181a873cadcb6a0fdf67df951c4b94f
BLAKE2b-256 8c488ce4dd9a59dbb508c0b2b6a7904de7f2e443225191adee239cac078d782d

See more details on using hashes here.

File details

Details for the file cohesive-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cohesive-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ae1abc0e2d7f07b6e56d90eb97554e794550cc2e68b7804addd3a91f3ddef7b4
MD5 71fbaf812e0d4fd28e86f47f6f7f5500
BLAKE2b-256 fbc5dd6edbe769bc97a8e2740e117d1010eb6a3a75c4abe50f78fab79a8701ea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page