Skip to main content

Determine language similarity by comparing translation pairs.

Project description

langsim

Compare languages even with just a few translation pairs.

NOTE: This project is very much a work in progress. It is not recommended to use it in a production environment. If you do use it, please open an issue and leave your impressions. I've only tested this on an AARCH Macbook Pro.

How to install from source

Clone the repository:

git clone https://github.com/ryderwishart/langsim.git

Run pip install -r requirements-dev.txt to install the dependencies.

Run pip install -e . to install the package in editable mode.

How to run

Run python examples/basic_usage.py to run the basic usage example.

Run python examples/using_debug_mode.py to run the basic usage example with debug mode.

Tests

Run pytest to test the code.

Note: tests are currently failing apparently due to a mismatch with the hydra-core version.

Example output

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric                          Score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Dist                            1.000 │
│ LenRat                          0.076 │
│ WSDiff                          0.003 │
│ WSKS                            0.003 │
│ PunctJS                         0.000 │
│ EntDiff                         0.013 │
│ LexSim                          0.600 │
│ CogProp                         0.769 │
│ MorphComp                       0.474 │
│ CogDist                         1.000 │
│ OverallSim                      0.628 │
└────────────────────────────────┴───────┘
╭─────────────────────────────────────────────── Metric Legend ────────────────────────────────────────────────╮
│       Line: Line                                                                                             │
│                                                                                                              │
│       Dist: Distortion                                                                                       │
│                                                                                                              │
│     LenRat: Length ratio std                                                                                 │
│                                                                                                              │
│     WSDiff: Whitespace ratio diff                                                                            │
│                                                                                                              │
│       WSKS: Whitespace KS statistic                                                                          │
│                                                                                                              │
│    PunctJS: Punctuation JS divergence                                                                        │
│                                                                                                              │
│    EntDiff: Entropy diff                                                                                     │
│                                                                                                              │
│     LexSim: Lexical similarity                                                                               │
│                                                                                                              │
│    CogProp: Cognate proportion                                                                               │
│                                                                                                              │
│    CogDist: Cognate-based distortion                                                                         │
│                                                                                                              │
│  MorphComp: Morphological complexity                                                                         │
│                                                                                                              │
│ OverallSim: Overall Similarity                                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                                              Pairwise Line Scores                                              
┏━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃   Line    Dist  LenRat  WSDiff    WSKS  Punct…  EntDi…   LexSim  CogPr…  CogDist  Morph…  Overal… ┃
┡━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│      1   1.000   0.037   0.006   0.006   0.000   0.005    0.667   0.800    1.000   0.560    0.656 │
│      2   1.000   0.107   0.009   0.009   0.000   0.121    0.500   0.667    1.000   0.551    0.589 │
│      3   1.000   0.067   0.008   0.008   0.000   0.208    0.667   0.800    1.000   0.442    0.626 │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┴─────────┴────────┴─────────┘

Metrics Explanation

Before listing the comparison metrics, a few caveats are necessary. The comparisons employed in this library are intended to serve as a starting point. Please open issues or PRs with any suggested revisions. I have attempted to be language-agnostic, so you could really compare any two sets of strings. Accordingly, I have avoided comparisons that rely on structured knowledge. Tokenization is completely naive, for example, and syntax is not considered.

I would like to add phonological similarity metrics in the future, and am waiting for code to be released from the Greek Room library.

The following metrics are used in metrics.py to compare language samples. Each metric provides a different perspective on the similarity or difference between the samples:

  1. Distortion (Dist): Measures the alignment of lines between two samples. A value of 1.0 indicates perfect alignment, while lower values indicate greater distortion.

  2. Length Ratio Standard Deviation (LenRat): Calculates the standard deviation of the length ratios of corresponding lines in the samples. A lower value indicates more consistent line lengths between the samples.

  3. Whitespace Ratio Difference (WSDiff): Compares the proportion of whitespace characters in the samples. A lower value indicates more similar whitespace usage.

  4. Whitespace KS Statistic (WSKS): Uses the Kolmogorov-Smirnov statistic to compare the distribution of whitespace characters in the samples. A lower value indicates more similar distributions.

  5. Punctuation Jensen-Shannon Divergence (PunctJS): Measures the divergence in punctuation usage between the samples using the Jensen-Shannon divergence. A lower value indicates more similar punctuation distributions.

  6. Entropy Difference (EntDiff): Compares the entropy (a measure of randomness) of the character distributions in the samples. A lower value indicates more similar entropy values.

  7. Lexical Similarity (LexSim): Measures the proportion of shared words between the samples. A higher value indicates greater lexical similarity.

  8. Cognate Proportion (CogProp): Calculates the proportion of cognates (words with a common etymological origin) between the samples. A higher value indicates a higher proportion of cognates.

  9. Cognate-based Distortion (CogDist): Measures the alignment of cognates between the samples. A value of 1.0 indicates perfect alignment, while lower values indicate greater distortion.

  10. Morphological Complexity (MorphComp): Compares the complexity of word forms in the samples by analyzing the distribution of word tokens. A higher value indicates more similar morphological complexity.

  11. Overall Similarity (OverallSim): A weighted combination of the above metrics to provide an overall similarity score between the samples. Higher values indicate greater overall similarity.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langsim-0.1.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

langsim-0.1.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file langsim-0.1.0.tar.gz.

File metadata

  • Download URL: langsim-0.1.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for langsim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f8097aef2db02f7c06f7b26e2a706a6bd75cb27da1c1c51d2cd83981df089739
MD5 918086f17929cedc480807d7cafa7186
BLAKE2b-256 a66e9ed3d4d606d85c7b1caa6ebc0086917f17e5136207d4dbffeaf5661b570c

See more details on using hashes here.

File details

Details for the file langsim-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: langsim-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for langsim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 15b36a9ad2e30bfc3fab82bcc038e16ff3e91eb4ea8bdbc59d93281e086e143f
MD5 0ad28ad04a8091d2b1f12d1513884244
BLAKE2b-256 1a06f31aa20724821ab09c0bcb5b209d9a699c12cbd0eb96ecf5f1150f9fd295

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page