Skip to main content

Lightweight Python library that uses sentence embeddings to create naturally coherent segments of text akin to paragraphs.

Project description

cohesive

cohesive is a lightweight segmenter that uses sentence embeddings to split documents into naturally coherent segments akin to paragraphs.

Installation

You can install cohesive using pip:

pip install cohesive

Using cohesive

To start using cohesive, import the CohesiveTextSegmenter and create a new instance:

from cohesive import CohesiveTextSegmenter

# By default, cohesive uses paraphrase-MiniLM-L6-v2, which produces good
# results, but you can specify any SentenceTransformer model.
# For example, lets use all-MiniLM-L6-v2 ...
cohesive = CohesiveTextSegmenter("all-MiniLM-L6-v2")

# Then, all you need to do is call the generate_tiles method and pass
# in an array of sentences.
cohesive.generate_segments(sentences)

Finetuning cohesive

cohesive users can finetune several parameters, which all impact the final segmentation results in different ways. Here is a quick summary:

  • Alpha:

    • Role: Used as a weight in combining global and local similarities.
    • Impact: A higher alpha places more emphasis on global similarities, making the segmentation more influenced by overall similarity between sentences. Conversely, a lower alpha gives more weight to local similarities within the context window.
    • Default: 0.5
  • Context Window:

    • Role: Determines the size of the context window used to calculate local similarities between sentences.
    • Impact: A smaller context window focuses on very close neighbors, capturing fine-grained local relationships. This may be suitable for documents where coherence is established within a small span of sentences. On the other hand, a larger context window considers a broader context, capturing longer-range dependencies and global patterns.
    • Default: 6
  • Decay:

    • Role: Used to calculate decay factors based on distances between sentence indices when combining similarities.
    • Impact: A higher decay results in faster decay with increasing distance between sentences. This means that sentences further apart contribute less to the overall similarity, emphasizing local cohesion. A lower decay allows for longer-range dependencies to impact the segmentation.
    • Default: 0.8
  • Resolution:

    • Role: Used in the community detection algorithm.
    • Impact: A higher resolution value leads to more and smaller communities, potentially yielding finer-grained segmentation. Conversely, a lower resolution results in fewer and larger communities, offering a more consolidated segmentation.
    • Default: 1.0

To modify the parameters, either pass in the appropriate parameter name and value when you call the create_segments method, or use the dedicated finetune_params function:

# Via create_segments
cohesive.create_segments(sentences, context_window=3)

# Via finetune_params
cohesive.finetune_params(alpha=1, decay=0.2)

Note: Any update to the parameters is stateful.

At any time you can view the current parameters by calling the get_params method.

Viewing the segments

When create_segments has finished, cohesive will print a summary of the total number of segments that were created.

To view the segments, simply call the print_segments method.

You can also view the start and end indices of sentences with a segment via the print_segment_boundaries function.

References

cohesive is inspired by an article written by Massimiliano Costacurta, published in Towards Data Science in June 2023: Text Tiling Done Right: Building Solid Foundations for your Personal LLM. The source code for this article can be accessed here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cohesive-0.1.4.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

cohesive-0.1.4-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file cohesive-0.1.4.tar.gz.

File metadata

  • Download URL: cohesive-0.1.4.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f68e2c0d3a3c66a5f6202f37dfa3debd6bfcd1eb62dc318fc478857f9444eea5
MD5 2d191f5ec33a38289ef5a95d879f2647
BLAKE2b-256 910674f1b5aa3b65a3e52b762554221654b3e6d60f2c3761572f381390f83074

See more details on using hashes here.

File details

Details for the file cohesive-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: cohesive-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Windows/10

File hashes

Hashes for cohesive-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4732e466dba55654eacd91c2a90293e7014ba7b025173ab9f88db57d2bbc5838
MD5 34c8456dd0227e3fe94b8bf48aa68077
BLAKE2b-256 8a24697da7db49ac2d0e3e0008bc113eaaa37ff2bef78e8326c610e00b03c166

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page