Measuring corpus similarity in Python
Project description
corpus_similarity
Measure the similarity between two corpora (text datasets). The measures work best when each corpus is at least 10k words.
from corpus_similarity import Similarity
cs = Similarity(language = "eng")
result = cs.calculate(corpus1, corpus2)
The package contains all preprocessing and training. Only the language needs to be specified. A list of supported languages is provided below.
Input
The Similarity.calculate method requires two input corpora. These can be a list of strings or a filename (supports .txt and .gz files).
Output
The output is a scalar measure of how similar the two corpora are. The values fall between 0 (very different) and 1 (very similar). The values are consistent within languages, but not across languages. For example, Swedish has higher relative similarity than Estonian.
Installation
pip install corpus_similarity
pip install git+https://github.com/jonathandunn/corpus_similarity.git
Languages
Pacific Languages
haw, Hawaiian (Polynesian)
mri, te reo (Polynesian)
smo, Samoan (Polynesian)
ton, Tongan (Polynesian)
ceb, Cebuano (Austronesian)
mlg, Malagasy (Austronesian)
msa, Malay (Austronesian)
tgl, Tagalog (Austronesian)
Other Languages
vie, Vietnamese
ind, Indonesian
tgl, Tagalog
tam, Tamil
tel, Telugu
bul, Bulgarian
ces, Czech
lav, Latvian
pol, Polish
rus, Russian
slv, Slovenian
ukr, Ukrainian
dan, Danish
deu, German
eng, English
nld, Dutch
nor, Norwegian
swe, Swedish
ell, Greek
fas, Farsi
hin, Hindi
urd, Urdu
cat, Catalan
fra, French
glg, Galician
ita, Italian
por, Portuguese
ron, Romanian
spa, Spanish
jpn, Japanese
kor, Korean
ara, Arabic
heb, Hebrew
zho, Chinese
tha, Thai
tur, Turkish
est, Estonian
fin, Finnish
hun, Hungarian
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file corpus_similarity-1.0.1-py2.py3-none-any.whl
.
File metadata
- Download URL: corpus_similarity-1.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 2.8 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68948866e71b21482e6322e2e50bb66448be0b2debe9273064e1c3e490f6a5fb |
|
MD5 | 8b1eec26780cb1c39928fad693fdf780 |
|
BLAKE2b-256 | 210aea64551fe2a10a215e24f300c82ce1efb93580a54dd4d0a41ddec119e1c6 |