Skip to main content

Corpus level similarity measures.

Project description

comparing-corpora

A python library of similarity measures which allow measuring the perceptual similarity between set embeddings corpora.

Calculating distances between two text corpora

import compcor.corpus_metrics as corpus_metrics
from compcor.text_tokenizer_embedder import TextTokenizerEmbedder

setA = ['can you tell me how i would normally say thank you as a french person',
        'can you translate hi into spanish for me',
        'can you translate milk into spanish for me',
        'how can i say thank you very much in chinese',
        'how can i thank somebody in italian',
        'how could i say twin in chinese',
        'how do germans say goodnight',
        'how do i ask about the weather in chinese',
        'how do i say hotel in finnish',
        'how do i say bathroom in italian']

	setB = ['how can i say thank you very much in chinese',
			'how can i thank somebody in italian',
			'how could i say twin in chinese',
			'how do they say tacos in mexico',
			'how do they say yes in brazil',
			'how do vietnameses people say hello',
			'how do you say cat in spanish',
			'how do you say dog in spanish',
			'how do you say fast in spanish',
			'how do you say good bye in french',
			'how do you say goodbye in spanish',
			'how do you say hello in french',
			'how do you say hello in japanese',
			'how do you say hello in mexico']

Given two corpora of strings, we want to calculate the distance between them. comparing-corpora provides the following distance metrics. The first metrics operate on a sentence-level embedding, while the last two operate on the token frequencies.

Name function level description
T-test ttest_distance sentence T-test p-value on difference in elementwise means
Medoid medoid_distance sentence Cosine distance between elementwise means
Classifier classifier_distance sentence Classifiability between reference and target
PR (recision and recall) PR_distance sentence Assessing distributional precision and recall
IRPR (information-retrieval precision and recall) IRPR_distance sentence Average distance between closest samples pairs
DC (density and coverage) dc_distance sentence Estimating manifolds density and coverage
MAUVE mauve_distance sentence Quality and diversity via divergence frontiers
FID (Frechet Inception Distance) fid_distance sentence Wasserstein distance between densities
Chi-squared ($\chi^2$) chi_square_distance token Word/Token count comparison
Zipf zipf_distance token Unigram rank-frequency statistics

To calculate inter-corpora distance, the raw corpora can be provided, in which case they will be embedded inside the function; this is determined by the 'model' argument (by default, 'all-MiniLM-L6-v2'), if the distance operates on the sentence level, otherwise by the 'tokenizer' argument.

distance = corpus_metrics.fid_distance(corpus1=setA, corpus2=setB)
print("fid_distance={}".format(distance))
#fid_distance=0.7498146370755459

The metrics also accept already-embedded corpora (possibly using a custom embedding):

embedder = TextTokenizerEmbedder(embedding_model_name = "all-MiniLM-L12-v2")

# due to change of the embedding model the results should differ
embeddingA = embedder.embed_sentences(setA)
embeddingB = embedder.embed_sentences(setB)

distance = corpus_metrics.fid_distance(corpus1=embeddingA, corpus2=embeddingB)
print("fid_distance={}".format(distance))
#fid_distance=0.7671713341732544

tokensA = embedder.tokenize_sentences(setA)
tokensB = embedder.tokenize_sentences(setB)

distance = corpus_metrics.chi_square_distance(corpus1=tokensA, corpus2=tokensB)
print("chi_square_distance={}".format(distance))
#chi_square_distance=0.9987177546738071

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compcor-1.0.5.tar.gz (11.7 kB view hashes)

Uploaded Source

Built Distribution

compcor-1.0.5-py3-none-any.whl (11.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page