NLP text similarity calculation
Project description
Intro
Sim•phile = "the love of similarities"
The aim is to proved easy access to text similairty metods that are language-agnostic and (ideally) much faster in execution time than methods that employ text embeddings.
- Compression Similairty – leverages the pattern recognition of compression algorithms
- Euclidian Similarity – Treating text like points in multi-dimensional space and calculating their closeness
- Jaccard Similairy – Texts are more similar the more their words overlap
Use Cases:
- When speed is required
- as fast pre-filters of results to reduce the set then fed to more CPU-intensive methods (e.g. embeddings)
- when language is unknown
- non-language comparisons (e.g. URL clustering)
- language detection (e.g. compare a text to Spanish, English, French, etc. lexicons and return match with highest score)
Usage:
pip install simphile
Documentation
Simphile text similarity documentation
E-Z ways to help
- Give this repo a ⭐️
- Vote up this answer on Stack Overflow!
Brief Explanations
Compression Similarity
Compression algorithms find patterns in files in order to shrink them. This method uses that pattern detection to measure similarity. If a compressor can use the patterns that it found in text_a to also decently compress text_b, then that means there are similar patterns in both files. The crux of the similarity score is computed akin to this pseudocode example:
length(compress(concatenate(text_a, text_b))) / (length(compress(text_a)) + length(compress(text_b)))
Further Reading:
- "The Similarity Metric" - the origin of this method
- a nice writeup
Jaccard Similarity
All of the write-ups I have seen for Jaccard get it wrong in the implementation. They all use set() data structures. At a quick glance this makes because the method uses set arithmetic (e.g. union, intersection). However, sets don't allow duplicate elements, so this is unsatisfactory for text analysis. For example "dog cat cat cat" and "dog dog dog cat" are two very different types of pet owners, but using sets would see that as {"dog", "cat"} and another {"dog", "cat"} and 100% similar.
This imeplementation of Jaccard uses set arithmetic on lists.
Further Reading:
- Vote up this answer on Stack Overflow!
- Jaccard Index on Wikipedia
Euclidian Similarity
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.