Skip to main content

Python package for extractive text summarization using various embeddings and methods.

Project description

MultiExtractiveSummarizer

MultiExtractiveSummarizer is a Python package designed for extractive text summarization. It leverages advanced embedding techniques and sentence ranking algorithms to provide high-quality summaries of text documents. This package currently includes embedding methods from SBERT and TF-IDF, and sentence ranking using LexRank and K-means clustering. Future updates will include additional embedding methods like Word2Vec, GloVe, and BERT embeddings, as well as other sentence ranking algorithms such as TextRank and KLA.

Table of Contents

Installation

You can install MultiExtractiveSummarizer from PyPI using pip:

pip install MultiExtractiveSummarizer

Description

Extractive Summarization

Extractive summarization involves selecting sentences from a document to create a summary that retains the most important information. Unlike abstractive summarization, which generates new sentences, extractive summarization works by identifying and extracting existing sentences.

Embedding Methods

  1. SBERT (Sentence-BERT): SBERT is a modification of BERT that uses Siamese and triplet networks to derive semantically meaningful sentence embeddings.
  2. TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a corpus.

Sentence Ranking Methods

  1. LexRank: LexRank is a graph-based algorithm for computing sentence importance based on eigenvector centrality in a similarity graph.
  2. K-means Clustering: K-means is a clustering algorithm that partitions sentences into k clusters, and representative sentences from each cluster are selected for the summary.
  3. K-means Clustering-2 : KMeans clustering algorithm to identify key sentences in a document. The process involves the following steps:
    • Cluster Determination: Based on the number of sentences in the document, the number of clusters is set to 2 if there are fewer than 10 sentences, and 3 if there are 10 or more sentences.
    • Clustering: The KMeans algorithm is applied to the sentence embeddings, grouping the sentences into clusters based on their similarity.
    • Distance Calculation: For each sentence, the Euclidean distance to its respective cluster centroid is calculated.
    • Sentence Selection: The sentences are then ranked based on their proximity to the cluster centroids. The top num_sentences closest sentences are selected for the summary.
    • Original Order: The selected sentences are sorted to maintain their original order in the document, ensuring a coherent summary.

Features

  • Flexible Embedding Methods: Choose between SBERT and TF-IDF for embedding sentences.
  • Multiple Sentence Ranking Algorithms: Use LexRank or K-means clustering to rank sentences and create summaries.
  • Modular and Extensible: Designed to easily incorporate new embedding methods and ranking algorithms.

Usage

Basic Usage

Here's an example of how to use the MultiExtractiveSummarizer package to create a summary of a text document.

from MultiExtractiveSummarizer import MultiExtractiveSummarizer

# Initialize the summarizer
summarizer = MultiExtractiveSummarizer(embedding_method='sbert', summarization_method='lexrank')

# Example text document
text = """
Your text document goes here...
"""

# Generate the summary with number of sentences
summary = summarizer.summarize(text, num_sentences=5)

print("Summary:")
print(summary)

# Generate the summary ratio of text
summary = summarizer.summarize(text, ratio=0.5)

print("Summary:")
print(summary)

Advanced Usage

For advanced usage, you can specify different parameters for embedding methods and sentence ranking algorithms.

from MultiExtractiveSummarizer import MultiExtractiveSummarizer

# Initialize the summarizer with TF-IDF and K-means
summarizer = MultiExtractiveSummarizer(embedding_method='tfidf', summarization_method='kmeans')

# Example text document
text = """
Your long text document goes here...
"""

# Generate the summary
summary = summarizer.summarize(text, num_sentences=5)

print("Summary:")
print(summary)

Future Work

I plan to expand the capabilities of the MultiExtractiveSummarizer package by including:

  • Additional embedding methods: Word2Vec, GloVe, and BERT embeddings.
  • New sentence ranking algorithms: TextRank, KLA, and others.

Stay tuned for updates and new features!

Contributing

We welcome contributions from the community. If you have suggestions or would like to contribute, please fork the repository and create a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiextractivesummarizer-0.2.2.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file multiextractivesummarizer-0.2.2.tar.gz.

File metadata

File hashes

Hashes for multiextractivesummarizer-0.2.2.tar.gz
Algorithm Hash digest
SHA256 dc4754fcabb7b738479a1fa2e232a307161044dda4e666cc0f4cd2d7dad02717
MD5 7432bb5e6b997abb3601b974b35eacda
BLAKE2b-256 13a29828c6cc508843c1f5ba660777c2b0581d984238f20f2ea4acab8eead18d

See more details on using hashes here.

File details

Details for the file MultiExtractiveSummarizer-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for MultiExtractiveSummarizer-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 19fb923f156c1c7908b92637040fb3cc90d77c082393f3c5b0b40c7b39708469
MD5 67c64ec77b8bb20efd99d2c144427201
BLAKE2b-256 b72b7f870db40b1c485073e56b86163f65615105e7a1ebc3bdf8dc658eb9a93d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page