NLP text similarity calculation

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Simphile

Python Text Similarity NLP Libray

master passing

Intro

Sim•phile = "the love of similarities"

The aim is to proved easy access to text similairty metods that are language-agnostic and (ideally) much faster in execution time than methods that employ text embeddings.

Compression Similairty – leverages the pattern recognition of compression algorithms
Euclidian Similarity – Treating text like points in multi-dimensional space and calculating their closeness
Jaccard Similairy – Texts are more similar the more their words overlap

Use Cases:

When speed is required
- as fast pre-filters of results to reduce the set then fed to more CPU-intensive methods (e.g. embeddings)
when language is unknown
non-language comparisons (e.g. URL clustering)
language detection (e.g. compare a text to Spanish, English, French, etc. lexicons and return match with highest score)

Usage:

pip install simphile

Documentation

Simphile text similarity documentation

E-Z ways to help

Give this repo a ⭐️
Vote up this answer on Stack Overflow!

Brief Explanations

Compression Similarity

Compression algorithms find patterns in files in order to shrink them. This method uses that pattern detection to measure similarity. If a compressor can use the patterns that it found in text_a to also decently compress text_b, then that means there are similar patterns in both files. The crux of the similarity score is computed akin to this pseudocode example:

length(compress(concatenate(text_a, text_b))) / (length(compress(text_a)) + length(compress(text_b)))

Jaccard Similarity

Jaccard Formula

All of the write-ups I have seen for Jaccard get it wrong in the implementation. They all use set() data structures. At a quick glance this makes because the method uses set arithmetic (e.g. union, intersection). However, sets don't allow duplicate elements, so this is unsatisfactory for text analysis. For example "dog cat cat cat" and "dog dog dog cat" are two very different types of pet owners, but using sets would see that as {"dog", "cat"} and another {"dog", "cat"} and 100% similar.

This imeplementation of Jaccard uses set arithmetic on lists.

Euclidian Similarity

Project details

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

1.0.2

Sep 30, 2022

1.0.1

Sep 27, 2022

1.0.0

Sep 27, 2022

0.1.12

Sep 27, 2022

0.1.11

Sep 26, 2022

0.1.10

Sep 26, 2022

0.1.9

Sep 26, 2022

0.1.8

Sep 26, 2022

0.1.7

Sep 26, 2022

0.1.5

Sep 26, 2022

This version

0.1.4

Sep 26, 2022

0.1.3

Sep 26, 2022

0.1.2

Sep 25, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simphile-0.1.4.tar.gz (5.8 kB view hashes)

Uploaded Sep 26, 2022 Source

Built Distribution

simphile-0.1.4-py3-none-any.whl (6.2 kB view hashes)

Uploaded Sep 26, 2022 Python 3

Hashes for simphile-0.1.4.tar.gz

Hashes for simphile-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`9111d1561906797db9ad4e09d813b6e5fc268b9f852e5c1c825ea8a761ca10e5`
MD5	`e5ee3061bce2895161b60f5b5f4bd37b`
BLAKE2b-256	`1c1e83d3bb77094cbe759aa4a5c324fcd17b88e7eec8ab6e02d28869731a36da`

Hashes for simphile-0.1.4-py3-none-any.whl

Hashes for simphile-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bed2a1174cb6100c42af633624778bc0938579814921a6d8ca258ccac922c59f`
MD5	`76e5bd0ecd1685e8defbbf43a6e3bd98`
BLAKE2b-256	`9718f379f1ed3872bdc52d84b8e5c3faecf18ab8386713fce51c3a25c6381b07`