Skip to main content

Measuring corpus similarity in Python

Project description

corpus_similarity

Measure the similarity between two corpora (text datasets). The measures work best when each corpus is at least 10k words.

from corpus_similarity import Similarity
cs = Similarity(language = "eng")

result = cs.calculate(corpus1, corpus2)

The package contains all preprocessing and training. Only the language needs to be specified. A list of supported languages is provided below.

Input

The Similarity.calculate method requires two input corpora. These can be a list of strings or a filename (supports .txt and .gz files).

Output

The output is a scalar measure of how similar the two corpora are. The values fall between 0 (very different) and 1 (very similar). The values are consistent within languages, but not across languages. For example, Swedish has higher relative similarity than Estonian.

Installation

pip install corpus_similarity

pip install git+https://github.com/jonathandunn/corpus_similarity.git

Languages

Pacific Languages

haw, Hawaiian (Polynesian)

mri, te reo (Polynesian)

smo, Samoan (Polynesian)

ton, Tongan (Polynesian)

ceb, Cebuano (Austronesian)

mlg, Malagasy (Austronesian)

msa, Malay (Austronesian)

tgl, Tagalog (Austronesian)

Other Languages

vie, Vietnamese

ind, Indonesian

tgl, Tagalog

tam, Tamil

tel, Telugu

bul, Bulgarian

ces, Czech

lav, Latvian

pol, Polish

rus, Russian

slv, Slovenian

ukr, Ukrainian

dan, Danish

deu, German

eng, English

nld, Dutch

nor, Norwegian

swe, Swedish

ell, Greek

fas, Farsi

hin, Hindi

urd, Urdu

cat, Catalan

fra, French

glg, Galician

ita, Italian

por, Portuguese

ron, Romanian

spa, Spanish

jpn, Japanese

kor, Korean

ara, Arabic

heb, Hebrew

zho, Chinese

tha, Thai

tur, Turkish

est, Estonian

fin, Finnish

hun, Hungarian

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

corpus_similarity-1.0.1-py2.py3-none-any.whl (2.8 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file corpus_similarity-1.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: corpus_similarity-1.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for corpus_similarity-1.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 68948866e71b21482e6322e2e50bb66448be0b2debe9273064e1c3e490f6a5fb
MD5 8b1eec26780cb1c39928fad693fdf780
BLAKE2b-256 210aea64551fe2a10a215e24f300c82ce1efb93580a54dd4d0a41ddec119e1c6

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page