Clustering similar sentences based on their fuzzy similarity.
Project description
Clustering similar sentences based on their fuzzy similarity.
For the word stem extractor I am using Snowball stemmers from NLTK library. So the following languages are supported:
Arabic
Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Portuguese
Romanian
Russian
Spanish
Swedish
Purpose of the Package
There are some popular algorithms on the market for mining topics in a textual set, such as LDA, but they don’t work very well for a small set of data, let’s say a thousand sentences for example.
This package tries to solve this for a small dataset by making the following naive assumption:
If I remove all the stopwords, get the stems from words and after that these sentences become similar, they are probably talking about the same, or similar, subject.
The goal here is to form clusters/groups with at least two similar sentences, isolated sentences (sentences that don’t look like any other in the total set) will not generate a cluster just for them. For these cases, the sentence will receive the -1 tag.
Usage
You can choose more than one method to compare the similarity between sentences:
ratio
partial_ratio
token_sort_ratio (the default one)
token_set_ratio
To know more about these methods click here.
>>> from fuzzy_sentences_clustering import look_for_clusters
>>> eng_sentences = [
"I live in New York",
"I want to buy a car",
"a car I would like to buy",
"Ohh New York, I lived there in 2005",
"I have a dog",
]
>>> ger_sentences = [
"ich lebe in New York",
"Ich möchte ein Auto kaufen",
"ein Auto, das ich kaufen möchte",
"Oh New York, Ich habe dort 2005 gelebt",
"Ich habe einen Hund",
]
>>> look_for_clusters(eng_sentences, similarity_threshold=60)
output: [1, 2, 2, 1, -1]
>>> look_for_clusters(ger_sentences, language="german", method="token_set_ratio", similarity_threshold=80)
output: [1, 2, 2, 1, -1]
Contribution
Contributions are welcome.
If you find a bug, please let me know.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzzy-sentences-clustering-1.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | fcd12cbb20a9fae5f7ec12c08705632d2da96a8795727c99d63f9114759be4cc |
|
MD5 | b18e94e2a24b0df7fc0692fbf1ff977a |
|
BLAKE2b-256 | 17b2aed54f165cecaaba2d0c098296ee318707f290fd1d9b416eeb3eeb12a1da |
Hashes for fuzzy_sentences_clustering-1.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d799afaa656773042ce47030cc8b118de0429ff24b1210085fb07bcc34b139c4 |
|
MD5 | a6fb88d98ba3428ad8f1f384c5f9a56d |
|
BLAKE2b-256 | 7de623d7b0281ce6ff9211f8bab4cc17c1d8bbf1dbde2d81df15a6a40032580d |