package for measuring the similarity of two texts

These details have not been verified by PyPI

Project description

similarity_check

similarity_check is a Python package for measuring the similarity of two texts.

Installation

Use the package manager pip to install similarity_check.

pip install similarity_check

Usage

sentence tranformer

documentation

sentence_tranformer( targets: Union[List[str], pd.DataFrame], target_group: Optional[Union[List[str], pd.DataFrame]]=None, target_col: Optional[str]=None, device: Optional[str] = None, model: Optional[str]=None, lang: Optional[str]='en', only_include: Optional[List[str]]=None, encode_batch: Optional[int] = 32, encode_target: Optional[bool] = True, remove_punct: Optional[bool]=True, remove_stop_words: Optional[bool]=True, stemm: Optional[bool]=False ):
- parameters:
  - targets: dataframe or list of targets text to compare with.
  - target_group (optional): goups ids for the target to match only a single target for each group, can either provide list of ids,
  - or the column name in the target dataframe.
  - target_col (partially optional): the target column name used to match, must be specified for dataframe matching.
  - device: the device to do the encoding on operations in (cpu|cuda),
  - model (optional): a string of the sentence tranformer model, to use instead of the default one, for more details.
  - lang (optional): the languge of the model ('en'|'ar').
  - only_include (optional): used only for dataframe matching, allow providing a list of column names to only include for the target matches, provide empty list to get only target_col.
  - encode_batch (optional): the number of sentences to encode in a batch.
  - encode_target: boolean flag to indicate whatever to enocde the targets when initilizing the object (to cache target encoding).
  - remove_punct: boolean flag to indicate whatever to remove punctuations.
  - remove_stop_words: boolean flag to indicate whatever to remove stop words.
  - stemm: boolean flag to indicate whatever to do stemming.
sentence_tranformer.match( source: Union[List[str], pd.DataFrame], source_col: Optional[str]=None, topn: Optional[int]=1, return_match_idx: Optional[bool]=False, threshold: Optional[float]=0.5, batch_size: Optional[int]=128 ) -> pd.DataFrame:
- parameters:
  - source: dataframe or list of input texts to find closest match for.
  - source_col (partially optional): the source column name used to match, must be specified for dataframe matching.
  - topn: number of matches to return.
  - threshold: the lowest threeshold to ignore matches below it.
  - return_match_idx: return an extra column for each match containing the index of the match within the target_names.
  - batch_size: the size of the batch in inputs to match with targets (to limit space usage).
- returns:
  - a data frame with 3 columns (source, target, score), and two extra columns for each extra match (target_2, score_2 ...), and an optional extra column for each match containg the match index, if return_match_idxs set to True.

examples

the given examples will only use english to present the output in the correct format, if you like to use arabic matching change the lang attribute of the sentence_tranformer object to 'ar'.

using lists

from similarity_check.checkers import sentence_tranformer_checker

X = ['test', 'remove test']
y =  ['tests', 'stop the test', 'testing']

### arabic example:
# X = ['حذف الاختبار', 'اختبار']
# y =  ['اختبارات', 'ايقاف الاختبار']
# st = sentence_tranformer(X, lang='ar')

st = sentence_tranformer(X)
match_df = st.match(y, topn=4, return_match_idx=True, threshold=0.6)

output:

source	score	prediction	match_idx	score_2	prediction_2	match_idx_2	score_3	prediction_3	match_idx_3	score_4	prediction_4	match_idx_4
test	0.922843	tests	0	0.908599	testing	2	0.721023	stop the test	1
remove test	0.728872	stop the test	1	nan		nan	nan		nan

using dataframes

from similarity_check.checkers import sentence_tranformer_checker

X = pd.DataFrame({
    'text': ['Cholera, a unspecified', 'remove test'],
    'id': [1, 2],
}
)

y = pd.DataFrame({
    'new_text': ['Cholera', 'stop the test', 'testing'],
    'new_id': [1, 2, 3],
    'tags': ['pos', 'neg', 'pos'],
    'num': [10, 22, 40],
    'day': [3, 5, 2],
}
)

st = sentence_tranformer_checker(y, target_col='new_text',target_group='tags', only_include=['new_id'])
match_df = st.match(X, source_col='text', topn=4, threshold=0.6, batch_size=1)

output:

text	id	score_1	new_text_1	new_id_1	score_2	new_text_2	new_id_2	score_3	new_text_3	new_id_3	score_4	new_text_4	new_id_4
test	1	0.922843	tests	1	0.908599	testing	3	0.721023	stop the test	2
remove test	2	0.728872	stop the test	2

word mover distance (deprecated)

english

# for medical use #
# from gensim.models import KeyedVectors
# download the model from here: https://github.com/ncbi-nlp/BioSentVec
# model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)

# for general usage #
import gensim.downloader as api
from similarity_check.checkers import word_mover_distance

model = api.load('glove-wiki-gigaword-300')

X = ['test now', 'remove test']
y =  ['tests', 'stop the test']

wmd = word_mover_distance(X, y, model)
wmd.clean_data()
match_df = wmd.match(topn=3)

arabic

from gensim.models import Word2Vec
from similarity_check.checkers import word_mover_distance

# download the embedding from here: https://github.com/bakrianoo/aravec (N-Grams Models, Wikipedia-SkipGram, Vec-Size:300)
model = Word2Vec.load('full_grams_sg_300_wiki/full_grams_sg_300_wiki.mdl')
# take the keydvectors as the model
model = model.wv

X = ['حذف الاختبار', 'اختبار']
y =  ['اختبارات', 'ايقاف الاختبار']

wmd = word_mover_distance(X, y, model)
wmd.clean_data()
match_df = wmd.match(topn=3)
match_df

word_mover_distance(source_names, target_names, model):
- parameters:
  - source_names: a list of input texts to find closest match for.
  - target_names: a list of targets text to compare with.
  - model (optional): a keyed vectors model (embeddings) to use for more details.
word_mover_distance.clean_data(remove_punct=True, remove_stop_words=True, stemm=False, lang='en'):
- parameters:
  - remove_punct: boolean flag to indicate whatever to remove punctuations.
  - remove_stop_words: boolean flag to indicate whatever to remove stop words.
  - stemm: boolean flag to indicate whatever to do stemming.
  - lang: language of the text to clean ('en'|'ar').
sentence_tranformer.match(topn=1, return_match_idx=False):
- parameters:
  - topn: number of matches to return.
  - return_match_idxs: return an extra column for each match containing the index of the match within the target_names.
- returns:
  - a data frame with 3 columns (source, target, score), and two extra columns for each extra match (target_2, score_2 ...), and an optional extra column for each match containg the match index, if return_match_idxs set to True.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.12

Nov 27, 2023

0.2.11

Nov 27, 2023

0.2.9

Nov 27, 2023

0.2.8

Nov 23, 2023

0.2.7

Nov 16, 2023

0.2.6

Nov 14, 2023

0.2.5

Nov 1, 2023

0.2.4

Oct 25, 2023

0.2.3

Oct 23, 2023

0.2.2

Jul 20, 2023

0.2.1

Jul 4, 2023

0.2.0

Jul 4, 2023

0.1.18

Jul 3, 2023

0.1.17

Jul 2, 2023

This version

0.1.16

Feb 5, 2023

0.1.15

Dec 16, 2022

0.1.14

Dec 16, 2022

0.1.13

Dec 16, 2022

0.1.12

Dec 16, 2022

0.1.11

Dec 16, 2022

0.1.10

Dec 16, 2022

0.1.9

Dec 16, 2022

0.1.8

Dec 16, 2022

0.1.7

Dec 15, 2022

0.1.6

Dec 15, 2022

0.1.5

Dec 15, 2022

0.1.4

Nov 24, 2022

0.1.3

Nov 24, 2022

0.1.2

Nov 23, 2022

0.1.1

Nov 23, 2022

0.0.17

Oct 26, 2022

0.0.16

Sep 29, 2022

0.0.15

Sep 29, 2022

0.0.14

Sep 29, 2022

0.0.13

Sep 4, 2022

0.0.12

Aug 31, 2022

0.0.11

Aug 31, 2022

0.0.10

Aug 29, 2022

0.0.9

Aug 29, 2022

0.0.8

Aug 23, 2022

0.0.7

Aug 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

similarity_check-0.1.16.tar.gz (7.8 kB view hashes)

Uploaded Feb 5, 2023 Source

Built Distribution

similarity_check-0.1.16-py3-none-any.whl (8.8 kB view hashes)

Uploaded Feb 5, 2023 Python 3

Hashes for similarity_check-0.1.16.tar.gz

Hashes for similarity_check-0.1.16.tar.gz
Algorithm	Hash digest
SHA256	`bde432d6dcbd2ce165d48ad3666ef524434a7645dd1334c5861f487eb51e4ad3`
MD5	`3df8c6cd9b58cb2b3a4d63ccdf43cab0`
BLAKE2b-256	`29595db37f257b160e803129c37680fc8584df7b459fa01c338a948177482b5a`

Hashes for similarity_check-0.1.16-py3-none-any.whl

Hashes for similarity_check-0.1.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`becb238e624af34078461ce6cfaf409c664a6ee140a71a006f9c458f14988850`
MD5	`c1d44c0b6904dcaee38b55251c77026d`
BLAKE2b-256	`7bdd2bf544515c72c8fb53c5a88689ef80fc82782ab0d6a46908de8d3adc72f4`