Skip to main content

Multilanguage summarizer, intended to improve text readability

Project description

Multilang Summarizer

This package implements an online multi-document summarization algorithm, intended to improve text readability. It supports the following languages:

  • 'de': 'German'
  • 'en': 'English'
  • 'es': 'Spanish'
  • 'fr': 'French'
  • 'hu': 'Hungarian'
  • 'it': 'Italian'
  • 'pt': 'Portuguese'
  • 'ro': 'Romanian'
  • 'sv': 'Swedish'

This work was partially supported by the National Council of Science and Technology (CONACYT) of Mexico, as part of the Cátedras CONACYT project Infraestructura para agilizar el desarrollo de sistemas centrados en el usuario, Ref. 3053.

Prerequisites

This projects has the following dependencies:

  • Pyphen
  • TextStat
  • sentence-splitter
  • numpy
  • NLTK (needs to download tokenization corpora)

Installing

The package is distributed via pip:

pip install multilang-summarizer

Use

The summarizer function directly implements the algorithm.

from multilang_summarizer.summarizer import summarizer

# summarizer(D_path, f_method, seq_method, lemmatizer, session_id=1)

It receives the path to a single document, a choice for three different sentence relevance functions (f_method), a relevant term selection method (seq_method), a lemmatizer and a session number (for memory purposes).

The choice of f_method can be one of three:

  • 'f1' : uses mean term likelihood as an indicator of relevance.
  • 'f2' : uses past term use and syllabic entropy to measure relevance and sentence complexity, respectively.
  • 'f3' : uses a weighted tfidf-based approach to measure relevance.

The choice of seq_method can be one of three:

  • 'partial' : uses simple matching between the last generated summary and the new input to identify relevant terms.
  • 'probability' : uses past term likelihoods to identify relevant terms.
  • 'lcs' : uses the Longest Common Subsequence algorithm to identify relevant terms between the last generated summary and the new input.

The lemmatizer object contains the lemmatization rules for the selected language. For English, it can be instanced as follows:

from multilang_summarizer.lemmatizer import Lemmatizer


lemmatizer = Lemmatizer.for_language("en")

Finally, session_id tells the algorithm to which running summary input D will be adding to. Different sessions can be opened at once. To clean the cache for all sessions use the following method:

from multilang_summarizer.summarizer import clean_working_memory

clean_working_memory()

In the end, summarizer returns a Document object containing all the sentences selected from all previous documents in the named session, and the f score with which each sentence was selected.

Running the tests

Two example scripts are provided in the repo:

  • tests/test_english.py
  • tests/test_spanish.py

To run them, the documents in the test_documents folder are required. Simply, execute

python tests/test_english.py

from the root folder after setup.

Example results

The following summary was obtained using f_method = 'f3' and seq_method = 'lcs' over the 10 news items in the test_documents folder.

For the second day in a row, astronauts boarded space shuttle Endeavour 
on Friday for liftoff on NASA's first space station construction flight.
The decision, which followed ``frank and 
candid'' discussions between the two partners, was not imposed by 
the United States, he said.
The main cargo Thursday was the Unity module, the first U.S.-built 
station part.
The shuttle contains 
the second station component.
The mechanical arm has never before moved anything so big.
The bigger worry, by far, was over Endeavour's pursuit and capture 
of Zarya, and its coupling with Unity.

In Spanish, the following summary was obtained using f_method = 'f1' and seq_method = 'lcs' over the 12 Spanish-language news items in the test_documents folder.

Tras una intensa búsqueda llevada a cabo por rescatistas, los 12 niños y su profesor fueron encontrados con vida y en buen estado de salud.
El rescate de los 12 niños y su entrenador que quedaron atrapados en una cueva inundada, en el norte de Tailandia, podría tomar semanas o incluso meses.
Pero aunque los 13 pudieran bucear, algunas partes de la cueva son demasiado estrechas,lo que exige mucho entrenamiento para poder pasar con tanques de buceo.
Los niños fueron encontrados, 200 metros más adelante.
Están cansados y necesitan un tiempo para reponerse.
La primera etapa del rescate es hacerles recuperar fuerzas.
Los 13 miembros están bien.

Authors

License

This project is licensed under the GPLv3 License - see the LICENSE file for details

Acknowledgments

Project details


Release history Release notifications | RSS feed

This version

1.7

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multilang-summarizer-1.7.tar.gz (15.5 kB view hashes)

Uploaded Source

Built Distribution

multilang_summarizer-1.7-py3-none-any.whl (20.9 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page