Multilanguage summarizer, intended to improve text readability

These details have not been verified by PyPI

Project links

Project description

Multilang Summarizer

This package implements an online multi-document summarization algorithm, intended to improve text readability. It supports the following languages:

'de': 'German'
'en': 'English'
'es': 'Spanish'
'fr': 'French'
'hu': 'Hungarian'
'it': 'Italian'
'pt': 'Portuguese'
'ro': 'Romanian'
'sv': 'Swedish'

This work was partially supported by the National Council of Science and Technology (CONACYT) of Mexico, as part of the Cátedras CONACYT project Infraestructura para agilizar el desarrollo de sistemas centrados en el usuario, Ref. 3053.

Prerequisites

This projects has the following dependencies:

Pyphen
TextStat
sentence-splitter
numpy
NLTK (needs to download tokenization corpora)

Installing

The package is distributed via pip:

pip install multilang-summarizer

Use

The summarizer function directly implements the algorithm.

from multilang_summarizer.summarizer import summarizer

# summarizer(D_path, f_method, seq_method, lemmatizer, session_id=1)

It receives the path to a single document, a choice for three different sentence relevance functions (f_method), a relevant term selection method (seq_method), a lemmatizer and a session number (for memory purposes).

The choice of f_method can be one of three:

'f1' : uses mean term likelihood as an indicator of relevance.
'f2' : uses past term use and syllabic entropy to measure relevance and sentence complexity, respectively.
'f3' : uses a weighted tfidf-based approach to measure relevance.

The choice of seq_method can be one of three:

'partial' : uses simple matching between the last generated summary and the new input to identify relevant terms.
'probability' : uses past term likelihoods to identify relevant terms.
'lcs' : uses the Longest Common Subsequence algorithm to identify relevant terms between the last generated summary and the new input.

The lemmatizer object contains the lemmatization rules for the selected language. For English, it can be instanced as follows:

from multilang_summarizer.lemmatizer import Lemmatizer


lemmatizer = Lemmatizer.for_language("en")

Finally, session_id tells the algorithm to which running summary input D will be adding to. Different sessions can be opened at once. To clean the cache for all sessions use the following method:

from multilang_summarizer.summarizer import clean_working_memory

clean_working_memory()

In the end, summarizer returns a Document object containing all the sentences selected from all previous documents in the named session, and the f score with which each sentence was selected.

Running the tests

Two example scripts are provided in the repo:

tests/test_english.py
tests/test_spanish.py

To run them, the documents in the test_documents folder are required. Simply, execute

python tests/test_english.py

from the root folder after setup.

Example results

The following summary was obtained using f_method = 'f3' and seq_method = 'lcs' over the 10 news items in the test_documents folder.

For the second day in a row, astronauts boarded space shuttle Endeavour 
on Friday for liftoff on NASA's first space station construction flight.
The decision, which followed ``frank and 
candid'' discussions between the two partners, was not imposed by 
the United States, he said.
The main cargo Thursday was the Unity module, the first U.S.-built 
station part.
The shuttle contains 
the second station component.
The mechanical arm has never before moved anything so big.
The bigger worry, by far, was over Endeavour's pursuit and capture 
of Zarya, and its coupling with Unity.

In Spanish, the following summary was obtained using f_method = 'f1' and seq_method = 'lcs' over the 12 Spanish-language news items in the test_documents folder.

Tras una intensa búsqueda llevada a cabo por rescatistas, los 12 niños y su profesor fueron encontrados con vida y en buen estado de salud.
El rescate de los 12 niños y su entrenador que quedaron atrapados en una cueva inundada, en el norte de Tailandia, podría tomar semanas o incluso meses.
Pero aunque los 13 pudieran bucear, algunas partes de la cueva son demasiado estrechas,lo que exige mucho entrenamiento para poder pasar con tanques de buceo.
Los niños fueron encontrados, 200 metros más adelante.
Están cansados y necesitan un tiempo para reponerse.
La primera etapa del rescate es hacerles recuperar fuerzas.
Los 13 miembros están bien.

Authors

Arturo Curiel - arturocuriel.com

License

This project is licensed under the GPLv3 License - see the LICENSE file for details

Acknowledgments

Thanks to Claudio Gutierrez Soto and Rafael Rojano for their input in the development on the algorithm.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.7

Jul 2, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multilang-summarizer-1.7.tar.gz (15.5 kB view details)

Uploaded Jul 2, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multilang_summarizer-1.7-py3-none-any.whl (20.9 MB view details)

Uploaded Jul 2, 2019 Python 3

File details

Details for the file multilang-summarizer-1.7.tar.gz.

File metadata

Download URL: multilang-summarizer-1.7.tar.gz
Upload date: Jul 2, 2019
Size: 15.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for multilang-summarizer-1.7.tar.gz
Algorithm	Hash digest
SHA256	`0c436c76ad87733ba4c9308f4b178759723ada8a1fc91697a38bf31a417f010f`
MD5	`84b9eff73c6ed34bdac64eb400f9e4d7`
BLAKE2b-256	`831cd10c577becd3bf190398946aab8540f8987805d7f0c0e4a705f56951f8e2`

See more details on using hashes here.

File details

Details for the file multilang_summarizer-1.7-py3-none-any.whl.

File metadata

Download URL: multilang_summarizer-1.7-py3-none-any.whl
Upload date: Jul 2, 2019
Size: 20.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for multilang_summarizer-1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3142cf181f6cb26e19bdb298c0fdee8d4e8a850285c80a543bb408043bc12bb0`
MD5	`0d19ead3108bbd4b3445407c72abc95b`
BLAKE2b-256	`369cdc47d63c609d290733a29f21893a0c2af5cfe5dc33a4df976c83cc7f67bf`

See more details on using hashes here.

multilang-summarizer 1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Multilang Summarizer

Prerequisites

Installing

Use

Running the tests

Example results

Authors

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes