Python package for extractive text summarization using various embeddings and methods.

These details have not been verified by PyPI

Project links

Homepage

Project description

MultiExtractiveSummarizer

MultiExtractiveSummarizer is a Python package designed for extractive text summarization. It leverages advanced embedding techniques and sentence ranking algorithms to provide high-quality summaries of text documents. This package currently includes embedding methods from SBERT and TF-IDF, and sentence ranking using LexRank and K-means clustering. Future updates will include additional embedding methods like Word2Vec, GloVe, and BERT embeddings, as well as other sentence ranking algorithms such as TextRank and KLA.

Installation
Description
Features
Usage
- Basic Usage
- Advanced Usage
Future Work
Contributing
License

Installation

You can install MultiExtractiveSummarizer from PyPI using pip:

pip install MultiExtractiveSummarizer

Description

Extractive Summarization

Extractive summarization involves selecting sentences from a document to create a summary that retains the most important information. Unlike abstractive summarization, which generates new sentences, extractive summarization works by identifying and extracting existing sentences.

Embedding Methods

SBERT (Sentence-BERT): SBERT is a modification of BERT that uses Siamese and triplet networks to derive semantically meaningful sentence embeddings.
TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a corpus.

Sentence Ranking Methods

LexRank: LexRank is a graph-based algorithm for computing sentence importance based on eigenvector centrality in a similarity graph.
K-means Clustering: K-means is a clustering algorithm that partitions sentences into k clusters, and representative sentences from each cluster are selected for the summary.
K-means Clustering-2 : KMeans clustering algorithm to identify key sentences in a document. The process involves the following steps:
- Cluster Determination: Based on the number of sentences in the document, the number of clusters is set to 2 if there are fewer than 10 sentences, and 3 if there are 10 or more sentences.
- Clustering: The KMeans algorithm is applied to the sentence embeddings, grouping the sentences into clusters based on their similarity.
- Distance Calculation: For each sentence, the Euclidean distance to its respective cluster centroid is calculated.
- Sentence Selection: The sentences are then ranked based on their proximity to the cluster centroids. The top num_sentences closest sentences are selected for the summary.
- Original Order: The selected sentences are sorted to maintain their original order in the document, ensuring a coherent summary.

Features

Flexible Embedding Methods: Choose between SBERT and TF-IDF for embedding sentences.
Multiple Sentence Ranking Algorithms: Use LexRank or K-means clustering to rank sentences and create summaries.
Modular and Extensible: Designed to easily incorporate new embedding methods and ranking algorithms.

Usage

Basic Usage

Here's an example of how to use the MultiExtractiveSummarizer package to create a summary of a text document.

from MultiExtractiveSummarizer import MultiExtractiveSummarizer

# Initialize the summarizer
summarizer = MultiExtractiveSummarizer(embedding_method='sbert', summarization_method='lexrank')

# Example text document
text = """
Your text document goes here...
"""

# Generate the summary with number of sentences
summary = summarizer.summarize(text, num_sentences=5)

print("Summary:")
print(summary)

# Generate the summary ratio of text
summary = summarizer.summarize(text, ratio=0.5)

print("Summary:")
print(summary)

Advanced Usage

For advanced usage, you can specify different parameters for embedding methods and sentence ranking algorithms.

from MultiExtractiveSummarizer import MultiExtractiveSummarizer

# Initialize the summarizer with TF-IDF and K-means
summarizer = MultiExtractiveSummarizer(embedding_method='tfidf', summarization_method='kmeans')

# Example text document
text = """
Your long text document goes here...
"""

# Generate the summary
summary = summarizer.summarize(text, num_sentences=5)

print("Summary:")
print(summary)

Future Work

I plan to expand the capabilities of the MultiExtractiveSummarizer package by including:

Additional embedding methods: Word2Vec, GloVe, and BERT embeddings.
New sentence ranking algorithms: TextRank, KLA, and others.

Stay tuned for updates and new features!

Contributing

We welcome contributions from the community. If you have suggestions or would like to contribute, please fork the repository and create a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.3

Jul 26, 2024

This version

0.2.2

Jul 26, 2024

0.2.1

Jul 26, 2024

0.1.1

Jul 7, 2024

0.1.0

Jul 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiextractivesummarizer-0.2.2.tar.gz (9.3 kB view details)

Uploaded Jul 26, 2024 Source

Built Distribution

MultiExtractiveSummarizer-0.2.2-py3-none-any.whl (9.4 kB view details)

Uploaded Jul 26, 2024 Python 3

File details

Details for the file multiextractivesummarizer-0.2.2.tar.gz.

File metadata

Download URL: multiextractivesummarizer-0.2.2.tar.gz
Upload date: Jul 26, 2024
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for multiextractivesummarizer-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`dc4754fcabb7b738479a1fa2e232a307161044dda4e666cc0f4cd2d7dad02717`
MD5	`7432bb5e6b997abb3601b974b35eacda`
BLAKE2b-256	`13a29828c6cc508843c1f5ba660777c2b0581d984238f20f2ea4acab8eead18d`

See more details on using hashes here.

File details

Details for the file MultiExtractiveSummarizer-0.2.2-py3-none-any.whl.

File metadata

Download URL: MultiExtractiveSummarizer-0.2.2-py3-none-any.whl
Upload date: Jul 26, 2024
Size: 9.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for MultiExtractiveSummarizer-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`19fb923f156c1c7908b92637040fb3cc90d77c082393f3c5b0b40c7b39708469`
MD5	`67c64ec77b8bb20efd99d2c144427201`
BLAKE2b-256	`b72b7f870db40b1c485073e56b86163f65615105e7a1ebc3bdf8dc658eb9a93d`