Skip to main content

Clustering similar sentences based on their fuzzy similarity.

Project description

Clustering similar sentences based on their fuzzy similarity.

For the word stem extractor I am using Snowball stemmers from NLTK library. So the following languages are supported:

  • Arabic

  • Danish

  • Dutch

  • English

  • Finnish

  • French

  • German

  • Hungarian

  • Italian

  • Norwegian

  • Portuguese

  • Romanian

  • Russian

  • Spanish

  • Swedish

Purpose of the Package

There are some popular algorithms on the market for mining topics in a textual set, such as LDA, but they don’t work very well for a small set of data, let’s say a thousand sentences for example.

This package tries to solve this for a small dataset by making the following naive assumption:

If I remove all the stopwords, get the stems from words and after that these sentences become similar, they are probably talking about the same, or similar, subject.

The goal here is to form clusters/groups with at least two similar sentences, isolated sentences (sentences that don’t look like any other in the total set) will not generate a cluster just for them. For these cases, the sentence will receive the -1 tag.

Usage

You can choose more than one method to compare the similarity between sentences:

  • ratio

  • partial_ratio

  • token_sort_ratio (the default one)

  • token_set_ratio

To know more about these methods click here.

>>> from fuzzy_sentences_clustering import look_for_clusters
>>> eng_sentences = [
        "I live in New York",
        "I want to buy a car",
        "a car I would like to buy",
        "Ohh New York, I lived there in 2005",
        "I have a dog",
    ]
>>> ger_sentences = [
        "ich lebe in New York",
        "Ich möchte ein Auto kaufen",
        "ein Auto, das ich kaufen möchte",
        "Oh New York, Ich habe dort 2005 gelebt",
        "Ich habe einen Hund",
    ]
>>> look_for_clusters(eng_sentences, similarity_threshold=60)
output: [1, 2, 2, 1, -1]
>>> look_for_clusters(ger_sentences, language="german", method="token_set_ratio", similarity_threshold=80)
output: [1, 2, 2, 1, -1]

Contribution

Contributions are welcome.

If you find a bug, please let me know.

Author

Cloves Paiva.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzy-sentences-clustering-1.1.2.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

fuzzy_sentences_clustering-1.1.2-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file fuzzy-sentences-clustering-1.1.2.tar.gz.

File metadata

  • Download URL: fuzzy-sentences-clustering-1.1.2.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.8.10 Linux/5.10.16.3-microsoft-standard-WSL2

File hashes

Hashes for fuzzy-sentences-clustering-1.1.2.tar.gz
Algorithm Hash digest
SHA256 fcd12cbb20a9fae5f7ec12c08705632d2da96a8795727c99d63f9114759be4cc
MD5 b18e94e2a24b0df7fc0692fbf1ff977a
BLAKE2b-256 17b2aed54f165cecaaba2d0c098296ee318707f290fd1d9b416eeb3eeb12a1da

See more details on using hashes here.

File details

Details for the file fuzzy_sentences_clustering-1.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for fuzzy_sentences_clustering-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d799afaa656773042ce47030cc8b118de0429ff24b1210085fb07bcc34b139c4
MD5 a6fb88d98ba3428ad8f1f384c5f9a56d
BLAKE2b-256 7de623d7b0281ce6ff9211f8bab4cc17c1d8bbf1dbde2d81df15a6a40032580d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page