Clustering similar sentences based on their fuzzy similarity.
Project description
Clustering similar sentences based on their fuzzy similarity.
Purpose of the Package
There are some popular algorithms on the market for mining topics in a textual set, such as LDA, but they don’t work very well for a small set of data, let’s say a thousand sentences for example.
This package tries to solve this for a small dataset by making the following naive assumption:
If I remove all the stopwords between two sentences, extract the stems of their words and after that find similar phrases between these two sentences, they are probably talking about the same, or similar, subject.
The goal here is to form clusters/groups with at least two similar sentences, isolated sentences (sentences that don’t look like any other in the total set) will not generate a cluster just for them. For these cases, the sentence will receive the -1 tag.
For while it works just for portuguese language.
Installation
You can install it using pip:
pip3 install fuzzy-sentences-clustering
Usage
>>> from fuzzy_sentences_clustering import look_for_clusters
>>> sentences = ["morava em florianópolis", "comprar um carro", "compra de um carro", "em florianópolis eu moro", "gosto de samba", "quero comer tapioca"]
>>> res = look_for_clusters(sentences=sentences, similarity_threshold=90)
>>> print(res)
output: [1, 2, 2, 1, -1, -1]
Contribution
Contributions are welcome.
If you find a bug, please let me know.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzzy-sentences-clustering-0.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1cf0877863eff4324ed4287797104e265dcadf4e5613c0e4041179f5bc699f5c |
|
MD5 | 42af5bca5b91240ed224331c9f07573e |
|
BLAKE2b-256 | 38488e7e815db2965cd71e623b64edbe3395da874cede3759131edaaf07c1181 |
Hashes for fuzzy_sentences_clustering-0.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b80843f711572dcfee918c560d2d06e1a4b36cb0a00fd18bc22cbbb0db40231f |
|
MD5 | 650a793843bf87abfb549dd0cd7c445f |
|
BLAKE2b-256 | 922ac47df7b1cb07988895a2576d1d284f207119fb5affc934632b1a4cab1ee0 |