Skip to main content

A text summarization tool using GloVe embeddings and PageRank algorithm

Project description

Text Summarizer

A Python-based text summarization tool that uses GloVe word embeddings and PageRank algorithm to generate extractive summaries of documents.

Features

  • Extractive Summarization: Uses sentence similarity and PageRank to identify the most important sentences
  • GloVe Embeddings: Leverages pre-trained GloVe word vectors for semantic similarity calculation
  • Multiple Input Methods: Support for single documents, CSV files, or interactive creation
  • GUI Interface: User-friendly Tkinter-based graphical interface
  • Command Line Interface: Scriptable command-line tool for automation
  • Batch Processing: Process multiple documents at once

Installation

Prerequisites

  • Python 3.8 or higher
  • Required packages (automatically installed): pandas, numpy, nltk, scikit-learn, networkx

Install from PyPI

pip install text-summarizer-aweebtaku

Install from Source

  1. Clone the repository:
git clone https://github.com/AWeebTaku/Summarizer.git
cd Summarizer
  1. Install the package:
pip install -e .

Download GloVe Embeddings

The tool requires GloVe word embeddings. Download the 100d version:

wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip

Place the glove.6B.100d.txt file in the project root or specify the path.

Usage

Command Line Interface

# Summarize a CSV file
text-summarizer-aweebtaku --csv-file data/tennis.csv --article-id 1

# Interactive mode
text-summarizer-aweebtaku

GUI Interface

python -m text_summarizer.ui

Python API

from text_summarizer import TextSummarizer
import pandas as pd

# Initialize summarizer
summarizer = TextSummarizer(glove_path='glove.6B.100d.txt')

# Load data
df = pd.DataFrame([{'article_id': 1, 'article_text': 'Your text here...'}])

# Run summarization
scored_sentences = summarizer.run_summarization(df)

# Get summary for article ID 1
article_text, summary = summarizer.summarize_article(scored_sentences, 1, df)
print(summary)

Data Format

Input data should be in CSV format with columns:

  • article_id: Unique identifier for each document
  • article_text: The full text of the document

Example:

article_id,article_text
1,"This is the first article. It contains multiple sentences..."
2,"This is the second article. It also has several sentences..."

Algorithm

The summarization process follows these steps:

  1. Sentence Tokenization: Split documents into individual sentences
  2. Text Cleaning: Remove punctuation, convert to lowercase, remove stopwords
  3. Sentence Vectorization: Convert sentences to vectors using GloVe embeddings
  4. Similarity Calculation: Compute cosine similarity between all sentence pairs
  5. PageRank Scoring: Apply PageRank algorithm to identify important sentences
  6. Summary Extraction: Select top-ranked sentences in original order

Configuration

  • glove_path: Path to GloVe embeddings file (default: 'glove.6B.100d.txt/glove.6B.100d.txt')
  • num_sentences: Number of sentences in summary (default: 5)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use this tool in your research, please cite:

@software{text_summarizer,
  title = {Text Summarizer},
  author = {Your Name},
  url = {https://github.com/AWeebTaku/Summarizer},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_summarizer_aweebtaku-1.0.0.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_summarizer_aweebtaku-1.0.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file text_summarizer_aweebtaku-1.0.0.tar.gz.

File metadata

File hashes

Hashes for text_summarizer_aweebtaku-1.0.0.tar.gz
Algorithm Hash digest
SHA256 75434d6af84c53371fe81bb5b5e32ee0e28164a5af9a52ff85be39db5fac8295
MD5 9696e01b2887f8503e05adffcdc4ee41
BLAKE2b-256 1a8bd17942f6f87f2dd3c0a18ed2cbd656c8f8a06ef847093a18ddb21206fad8

See more details on using hashes here.

File details

Details for the file text_summarizer_aweebtaku-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for text_summarizer_aweebtaku-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8bfe5232342b74f4cc410a64b3ea6a04b6aef8fdf9631463430b79e4be017bf7
MD5 d6cfed8b668d9fbf968affd3ec0929a0
BLAKE2b-256 38a236e744bd50860744c68496c34ece55509f4b021d55ba1789f9aa0ab317a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page