Determine language similarity by comparing translation pairs.
Project description
langsim
Compare languages even with just a few translation pairs.
NOTE: This project is very much a work in progress. It is not recommended to use it in a production environment. If you do use it, please open an issue and leave your impressions. I've only tested this on an AARCH Macbook Pro.
How to install from source
Clone the repository:
git clone https://github.com/ryderwishart/langsim.git
Run pip install -r requirements-dev.txt
to install the dependencies.
Run pip install -e .
to install the package in editable mode.
How to run
Run python examples/basic_usage.py
to run the basic usage example.
Run python examples/using_debug_mode.py
to run the basic usage example with debug mode.
Tests
Run pytest
to test the code.
Note: tests are currently failing apparently due to a mismatch with the
hydra-core
version.
Example output
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric ┃ Score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Dist │ 1.000 │
│ LenRat │ 0.076 │
│ WSDiff │ 0.003 │
│ WSKS │ 0.003 │
│ PunctJS │ 0.000 │
│ EntDiff │ 0.013 │
│ LexSim │ 0.600 │
│ CogProp │ 0.769 │
│ MorphComp │ 0.474 │
│ CogDist │ 1.000 │
│ OverallSim │ 0.628 │
└────────────────────────────────┴───────┘
╭─────────────────────────────────────────────── Metric Legend ────────────────────────────────────────────────╮
│ Line: Line │
│ │
│ Dist: Distortion │
│ │
│ LenRat: Length ratio std │
│ │
│ WSDiff: Whitespace ratio diff │
│ │
│ WSKS: Whitespace KS statistic │
│ │
│ PunctJS: Punctuation JS divergence │
│ │
│ EntDiff: Entropy diff │
│ │
│ LexSim: Lexical similarity │
│ │
│ CogProp: Cognate proportion │
│ │
│ CogDist: Cognate-based distortion │
│ │
│ MorphComp: Morphological complexity │
│ │
│ OverallSim: Overall Similarity │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Pairwise Line Scores
┏━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Line ┃ Dist ┃ LenRat ┃ WSDiff ┃ WSKS ┃ Punct… ┃ EntDi… ┃ LexSim ┃ CogPr… ┃ CogDist ┃ Morph… ┃ Overal… ┃
┡━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ 1 │ 1.000 │ 0.037 │ 0.006 │ 0.006 │ 0.000 │ 0.005 │ 0.667 │ 0.800 │ 1.000 │ 0.560 │ 0.656 │
│ 2 │ 1.000 │ 0.107 │ 0.009 │ 0.009 │ 0.000 │ 0.121 │ 0.500 │ 0.667 │ 1.000 │ 0.551 │ 0.589 │
│ 3 │ 1.000 │ 0.067 │ 0.008 │ 0.008 │ 0.000 │ 0.208 │ 0.667 │ 0.800 │ 1.000 │ 0.442 │ 0.626 │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┴─────────┴────────┴─────────┘
Metrics Explanation
Before listing the comparison metrics, a few caveats are necessary. The comparisons employed in this library are intended to serve as a starting point. Please open issues or PRs with any suggested revisions. I have attempted to be language-agnostic, so you could really compare any two sets of strings. Accordingly, I have avoided comparisons that rely on structured knowledge. Tokenization is completely naive, for example, and syntax is not considered.
I would like to add phonological similarity metrics in the future, and am waiting for code to be released from the Greek Room library.
The following metrics are used in metrics.py
to compare language samples. Each metric provides a different perspective on the similarity or difference between the samples:
-
Distortion (Dist): Measures the alignment of lines between two samples. A value of 1.0 indicates perfect alignment, while lower values indicate greater distortion.
-
Length Ratio Standard Deviation (LenRat): Calculates the standard deviation of the length ratios of corresponding lines in the samples. A lower value indicates more consistent line lengths between the samples.
-
Whitespace Ratio Difference (WSDiff): Compares the proportion of whitespace characters in the samples. A lower value indicates more similar whitespace usage.
-
Whitespace KS Statistic (WSKS): Uses the Kolmogorov-Smirnov statistic to compare the distribution of whitespace characters in the samples. A lower value indicates more similar distributions.
-
Punctuation Jensen-Shannon Divergence (PunctJS): Measures the divergence in punctuation usage between the samples using the Jensen-Shannon divergence. A lower value indicates more similar punctuation distributions.
-
Entropy Difference (EntDiff): Compares the entropy (a measure of randomness) of the character distributions in the samples. A lower value indicates more similar entropy values.
-
Lexical Similarity (LexSim): Measures the proportion of shared words between the samples. A higher value indicates greater lexical similarity.
-
Cognate Proportion (CogProp): Calculates the proportion of cognates (words with a common etymological origin) between the samples. A higher value indicates a higher proportion of cognates.
-
Cognate-based Distortion (CogDist): Measures the alignment of cognates between the samples. A value of 1.0 indicates perfect alignment, while lower values indicate greater distortion.
-
Morphological Complexity (MorphComp): Compares the complexity of word forms in the samples by analyzing the distribution of word tokens. A higher value indicates more similar morphological complexity.
-
Overall Similarity (OverallSim): A weighted combination of the above metrics to provide an overall similarity score between the samples. Higher values indicate greater overall similarity.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file langsim-0.1.0.tar.gz
.
File metadata
- Download URL: langsim-0.1.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8097aef2db02f7c06f7b26e2a706a6bd75cb27da1c1c51d2cd83981df089739 |
|
MD5 | 918086f17929cedc480807d7cafa7186 |
|
BLAKE2b-256 | a66e9ed3d4d606d85c7b1caa6ebc0086917f17e5136207d4dbffeaf5661b570c |
File details
Details for the file langsim-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: langsim-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15b36a9ad2e30bfc3fab82bcc038e16ff3e91eb4ea8bdbc59d93281e086e143f |
|
MD5 | 0ad28ad04a8091d2b1f12d1513884244 |
|
BLAKE2b-256 | 1a06f31aa20724821ab09c0bcb5b209d9a699c12cbd0eb96ecf5f1150f9fd295 |