Skip to main content

NLP text similarity calculation

Project description

Simphile

Python Text Similarity NLP Libray

License master passing

Intro

Sim•phile = "the love of similarities"

The aim is to proved easy access to text similairty metods that are language-agnostic and (ideally) much faster in execution time than methods that employ text embeddings.

  • Compression Similairty – leverages the pattern recognition of compression algorithms
  • Euclidian Similarity – Treating text like points in multi-dimensional space and calculating their closeness
  • Jaccard Similairy – Texts are more similar the more their words overlap

Use Cases:

  • When speed is required
    • as fast pre-filters of results to reduce the set then fed to more CPU-intensive methods (e.g. embeddings)
  • when language is unknown
  • non-language comparisons (e.g. URL clustering)
  • language detection (e.g. compare a text to Spanish, English, French, etc. lexicons and return match with highest score)

Usage:

pip install simphile

Documentation

Simphile text similarity documentation

E-Z ways to help

Brief Explanations

Compression Similarity

Compression algorithms find patterns in files in order to shrink them. This method uses that pattern detection to measure similarity. If a compressor can use the patterns that it found in text_a to also decently compress text_b, then that means there are similar patterns in both files. The crux of the similarity score is computed akin to this pseudocode example:

length(compress(concatenate(text_a, text_b))) / (length(compress(text_a)) + length(compress(text_b)))

Further Reading:

Jaccard Similarity

Jaccard Formula

All of the write-ups I have seen for Jaccard get it wrong in the implementation. They all use set() data structures. At a quick glance this makes because the method uses set arithmetic (e.g. union, intersection). However, sets don't allow duplicate elements, so this is unsatisfactory for text analysis. For example "dog cat cat cat" and "dog dog dog cat" are two very different types of pet owners, but using sets would see that as {"dog", "cat"} and another {"dog", "cat"} and 100% similar.

This imeplementation of Jaccard uses set arithmetic on lists.

Further Reading:

Euclidian Similarity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simphile-0.1.4.tar.gz (5.8 kB view hashes)

Uploaded Source

Built Distribution

simphile-0.1.4-py3-none-any.whl (6.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page