Python package for measuring distance between the lects represented by small raw corpora

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ilia-afanasev-1997

These details have not been verified by PyPI

License
- OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Python package for measuring distance between the lects represented by small raw corpora

Docker pylint

What is it?

corpus_distance is a Python package that allows to measure distance between the lects that are presented only by small (down to extremely small, <1000 tokens) raw (without any kind of morhological tagging, lemmatisation, or dependency parsing) corpora, and classify them. It joins frequency-based metrics and string similarity measurements into a hybrid distance scorer.

corpus_distance operates with 3-shingles, a sequences of 3 symbols, by which words are split (Zelenkov and Segalovich, 2007). This helps to spot more intricate patterns and correspondences within raw data, as well as to enhance the dataset size.

NB!

The classification is going to be only (and extremely) preliminary, as it is by default language-agnostic and does not use preliminary expert judgements or linguistic information. Basically, the more effort is put into the actual data, the more reliable are final results.

In addition, the results may not be used as a proof of language relationship (external classification), only as a supporting evidence for a tree topology (internal classification), as it is with any kind of phylogenetic methods in historical comparative studies.

One more important notion is that one should be very careful with using this package for a distantly related lects. As it is with any kind of language-agnostic methods, it loses precision with the increase of distance between analysed groups (Holman et al., 2008).

How to install

Version warning

By now, corpus_distance is not available for Python>=3.14. Use 3.12 for the smoothest possible experience.

From TestPyPI (development version; requires manual installation of dependencies; may contain bugs)

pip install biopython
pip install Levenshtein
pip install gensim
pip install pyjarowinkler

python3 -m pip install --index-url https://test.pypi.org/simple/ --no-deps --force-reinstall corpus_distance

From PyPI (release version)

pip install corpus_distance

From Docker

Clone repository.

git clone https://github.com/The-One-Who-Speaks-and-Depicts/corpus_distance.git

Cd to the repository root directory.
(optionally) Add your data and configuration file (see How to use/Preparation below), put both into the repository directory.
Create Docker image.

sudo docker build --tag IMAGE_NAME .

IMAGE_NAME is a name of the docker image of your choice 5. Run Docker image.

sudo docker run -i -t IMAGE_NAME /bin/bash

How to use

Preparation

Create a virtual environment with python3 -m venv ENV, where ENV is a name of your virtual environment.
Install package, following the instructions above.
Create a folder for your data.
Put files with your texts into this folder. Texts should be tokenised manually, and joined into a single string afterwards. The name of the files with texts should be of the following format: TEXT.LECT.EXTENSION, where:
- EXTENSION is preferrably .txt, as package works with files with raw text data;
- LECT is a name of the lect (idiolect, dialect, sociolect, regiolect, standard, etc.; any given variety of the language, such as English, or Polish, or Khislavichi, or Napoleon's French) that is the object of the classification
- TEXT is a unique identifier of the text within a given lect (for instance, NapoleonSpeech1, or John_Gospel)
Set up a configuration .json file (the example is in the repository). The parameters are:
- metrics_name: a user-defined name of the metrics
- store_path: a path to the folder for results storage
- content_path: a path to the data folder
- split: a share of tokens from your files that would be taken into consideration (useful for exploring size effects)
- topic_modelling: model may delete topic words, if this flag has value substitute, or not, if value is not_substitute. This heuristic helps to exclude the words that define the text, on the contrary to the ones that define the language. Alternatively, if the value is topic_words_only, the model will delete all non-topic words.
- lda_params: a set of parameters for a Latent Dirichlet Association model from gensim package and two custom parameters, required_topics_start and required_topics_num, used to regulate, the tokens from which topics the model will utilise further. required_topics_start is the first topic (defaults to 0), required_topics_num is the number of topics after required_topics_start that model will collect (defaults to 10).
- fasttext_params: a set of parameters for a FastText model that provides the classifier with the symbol embeddings
- soerensen: normalisation of frequency-based metrics by the Soerensen-Dice coefficient
- hybridisation: flag for use (or not use) of string similarity measure for non-coinciding 3-shingles
- hybridisation_as_array: regulates the way of hybridisation: either frequency-based metrics and string similarity measures values are taken as a single array, for which the mean score is counted, or they are taken separately, and their means are multiplied by each other. soerensen normalisation applies only when this parameter has false value.
- metrics: a particular string similarity measure. May be user defined, defaults are corpus_distance.distance_measurement.string_similarity.levenshtein_wrapper (simple edit distance), corpus_distance.distance_measurement.string_similarity.weighted_jaro_winkler_wrapper(edit distance, weighted by Jaro-Winkler distance), corpus_distance.distance_measurement.string_similarity.vector_measure_wrapper (counting differences by euclidean distance between vectors of symbols), and corpus_distance.distance_measurement.string_similarity.jaro_vector_wrapper (counting differences by euclidean distance between vectors of symbols, weighted by Jaro distance, in order to count for order)
- alphabet_normalisation: a normalisation of vector-based metrics by difference of alphabet entropy between given lects
- data_name: name of the dataset for visualisation (for example. South Slavic)
- outgroup: name of the lect that is the farthest from the others
- metrics: name of the metrics combination, by default containing all the given parameters
- classification_method: classification method for building tree, upgma or nj: either Unweighted Pair Group Method with Arithmetic Mean, or Neighbourhood-Joining
- store_path: the same as store path on the top.
The example of the config.json:

    {
        "metrics_name": "default_metrics_name",
        "data": {
            "dataset_params": {
                "store_path": "default",
                "split": 1,
                "content_path": "default",
                "topic_modelling": "not_substitute"
            },
            "lda_params": {
                "num_topics": 10,
                "alpha": "auto",
                "epochs": 300,
                "passes": 500,
                "random_state": 1,
                "required_topics_start": 0,
                "required_topics_num": 10
            },
            "fasttext_params": {
                "vector_size": 128,
                "window": 15,
                "min_count": 3,
                "workers": 1,
                "epochs": 300,
                "seed": 42,
                "sg": 1
            }
        },
        "hybridisation_parameters": {
            "soerensen": true,
            "hybridisation": true,
            "hybridisation_as_array": true,
            "metrics": "corpus_distance.distance_measurement.string_similarity.jaro_vector_wrapper",
            "alphabet_normalisation": true
        },
        "clusterisation_parameters": {
            "data_name": "Modern Standard Slavic",
            "outgroup": "Slovak",
            "metrics": "default_metrics_name",
            "classification_method": "upgma"
        }
    }

In case you are using Docker, do not forget to put your data and configuration file into the repository directory before creating Docker image.

Running the code

There are two ways of running the code: with prepared Jupyter Notebook, or independently.

Ready-made Jupyter Notebook

In the folder example, there is a tutorial notebook that outlines the inner workings of the package.

Using your own file (or from Docker console)

After data and configuration are ready, open Python interpreter:

python

Run the following commands:

from corpus_distance.pipeline import perform_clusterisation
perform_clusterisation(PATH_TO_CONFIG)

PATH_TO_CONFIG here is a path to config.json. This parameter may be empty, then the model performs clusterisation with default dataset (Modern Standard Slavic Gospels of John: Croatian, Slovak, Slovenian) and default parameters.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ilia-afanasev-1997

These details have not been verified by PyPI

License
- OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.7.1

Dec 23, 2025

0.6.7

May 7, 2025

0.5

Oct 20, 2024

0.4.0

Jun 1, 2024

0.3.3

Jun 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_distance-0.7.1.tar.gz (128.8 kB view details)

Uploaded Dec 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

corpus_distance-0.7.1-py3-none-any.whl (133.2 kB view details)

Uploaded Dec 23, 2025 Python 3

File details

Details for the file corpus_distance-0.7.1.tar.gz.

File metadata

Download URL: corpus_distance-0.7.1.tar.gz
Upload date: Dec 23, 2025
Size: 128.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for corpus_distance-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`b39e4eaebf55702e30456315473afc2a58f3d256751c3529de98a41756ee8a98`
MD5	`cf5ba2df812c1e0fdd49adcdccc81b03`
BLAKE2b-256	`f5ead4580f1321e88df586447227566f6aba3309cba829150469766341a2e05c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for corpus_distance-0.7.1.tar.gz:

Publisher: release_post_merge.yml on The-One-Who-Speaks-and-Depicts/corpus_distance

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: corpus_distance-0.7.1.tar.gz
- Subject digest: b39e4eaebf55702e30456315473afc2a58f3d256751c3529de98a41756ee8a98
- Sigstore transparency entry: 778005857
- Sigstore integration time: Dec 23, 2025
Source repository:
- Permalink: The-One-Who-Speaks-and-Depicts/corpus_distance@1f7650215587fcd8055940cb292d20c4df51f16d
- Branch / Tag: refs/tags/v0.7.1
- Owner: https://github.com/The-One-Who-Speaks-and-Depicts
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release_post_merge.yml@1f7650215587fcd8055940cb292d20c4df51f16d
- Trigger Event: push

File details

Details for the file corpus_distance-0.7.1-py3-none-any.whl.

File metadata

Download URL: corpus_distance-0.7.1-py3-none-any.whl
Upload date: Dec 23, 2025
Size: 133.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for corpus_distance-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`adfa97d50a83860681d79247fd7bc6ccb23ecb68c19bd638ac0aa08221ded6fb`
MD5	`151fd7603cfdde1d65ae13e38067f546`
BLAKE2b-256	`6eef0ed3f333fbcff7ce2bcb8dc25ecb3c4ab731d3c41c99cb971513806db35b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for corpus_distance-0.7.1-py3-none-any.whl:

Publisher: release_post_merge.yml on The-One-Who-Speaks-and-Depicts/corpus_distance

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: corpus_distance-0.7.1-py3-none-any.whl
- Subject digest: adfa97d50a83860681d79247fd7bc6ccb23ecb68c19bd638ac0aa08221ded6fb
- Sigstore transparency entry: 778005879
- Sigstore integration time: Dec 23, 2025
Source repository:
- Permalink: The-One-Who-Speaks-and-Depicts/corpus_distance@1f7650215587fcd8055940cb292d20c4df51f16d
- Branch / Tag: refs/tags/v0.7.1
- Owner: https://github.com/The-One-Who-Speaks-and-Depicts
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release_post_merge.yml@1f7650215587fcd8055940cb292d20c4df51f16d
- Trigger Event: push

corpus-distance 0.7.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Python package for measuring distance between the lects represented by small raw corpora

What is it?

NB!

How to install

Version warning

From TestPyPI (development version; requires manual installation of dependencies; may contain bugs)

From PyPI (release version)

From Docker

How to use

Preparation

Running the code

Ready-made Jupyter Notebook

Using your own file (or from Docker console)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance