Python package for measuring distance between the lects represented by small raw corpora
Project description
Python package for measuring distance between the lects represented by small raw corpora
What is it?
corpus_distance is a Python package that allows to measure distance between the lects that are presented only by small (down to extremely small, <1000 tokens) raw (without any kind of morhological tagging, lemmatisation, or dependency parsing) corpora, and classify them. It joins frequency-based metrics and string similarity measurements into a hybrid distance scorer.
corpus_distance operates with 3-shingles, a sequences of 3 symbols, by which words are split. This helps to spot more intricate patterns and correspondences within raw data, as well as to enhance the dataset size.
NB!
The classification is going to be only (and extemely) preliminary, as it is by default language-agnostic and does not use preliminary expert judgements or linguistic information. Basically, the more effort is put into the actual data, the more reliable are final results.
In addition, the results may not be used as a proof of language relationship (external classification), only as a supporting evidence for tree topology (internal classification), as it is with any kind of phylogenetic methods in historical comparative studies.
One more important notion is that one should be very careful with using this package for a distantly related lects. As it is with any kind of language-agnostic methods, they lose precision with the increase of distance between analysed groups.
How to install
From TestPyPI (development version; requires manual installation of dependencies; may contain bugs)
pip install biopython
pip install Levenshtein
pip install gensim
pip install pyjarowinkler
python3 -m pip install --index-url https://test.pypi.org/simple/ --no-deps --force-reinstall corpus_distance
How to use
Preparation
- Create a virtual environment with
python3 -m venv ENV
, whereENV
is a name of your virtual environment. - Install package, following the instructions above.
- Create a folder for your data.
- Put files with your texts into this folder. Texts should be tokenised manually, and joined into a single string afterwards. The name of the files with texts should be of the following format: TEXT.LECT.EXTENSION, where:
- EXTENSION is preferrably .txt, as package works with files with raw text data;
- LECT is a name of the lect (idiolect, dialect, sociolect, regiolect, standard, etc.; any given variety of the language, such as English, or Polish, or Khislavichi, or Napoleon's French) that is the object of the classification
- TEXT is a unique identifier of the text within a given lect (for instance, NapoleonSpeech1, or John_Gospel)
- Set up a configuration .json file (the example is in the repository). The parameters are:
store_path
: a path to the folder for results storagecontent_path
: a path to the data foldersplit
: a share of tokens from your files that would be taken into consideration (useful for exploring size effects)lda_params
: a set of parameters for a Latent Dirichlet Association model fromgensim
packagetopic_modelling
: model may delete topic words, if this flag has valuetrue
, or not, if value isfalse
. This heuristic helps to exclude the words that define the text, on the contrary to the ones that define the languagefasttext_params
: a set of parameters for a FastText model that provides the classifier with the symbol embeddingssoerensen
: normalisation of frequency-based metrics by the Soerensen-Dice coefficienthybridisation
: flag for use (or not use) of string similarity measure for non-coinciding 3-shingleshybridisation_as_array
: regulates the way of hybridisation: either frequency-based metrics and string similarity measures values are taken as a single array, for which the mean score is counted, or they are taken separately, and their means are multiplied by each other.soerensen
normalisation applies only when this parameter hasfalse
value.metrics
: a particular string similarity measure. May be user defined, defaults arecorpus_distance.distance_measurement.string_similarity.levenshtein_wrapper
(simple edit distance),corpus_distance.distance_measurement.string_similarity.weighted_jaro_winkler_wrapper
(edit distance, weighted by Jaro-Winkler distance),corpus_distance.distance_measurement.string_similarity.vector_measure_wrapper
(counting differences by euclidean distance between vectors of symbols), andcorpus_distance.distance_measurement.string_similarity.jaro_vector_wrapper
(counting differences by euclidean distance between vectors of symbols, weighted by Jaro distance, in order to count for order)alphabet_normalisation
: a normalisation of vector-based metrics by difference of alphabet entropy between given lectsdata_name
: name of the dataset for visualisation (for example. South Slavic)outgroup
: name of the lect that is the farthest from the othersmetrics
: name of the metrics combination, by default containing all the given parametersclassification_method
: classification method for building tree,upgma
ornj
: either Unweighted Pair Group Method with Arithmetic Mean, or Neighbourhood-Joiningstore_path
: the same asstore path
on the top.
- The example of the
config.json
:
{
"store_path": "default",
"metrics_name": "default_metrics_name",
"data": {
"content_path": "default",
"split": 1,
"lda_params": {
"num_topics": 10,
"alpha": "auto",
"epochs": 300,
"passes": 500
},
"topic_modelling": false,
"fasttext_params": {
"vector_size": 128,
"window": 15,
"min_count": 3,
"workers": 4,
"epochs": 300,
"seed": 42,
"sg": 1
}
},
"hybridisation_parameters": {
"soerensen": true,
"hybridisation": true,
"hybridisation_as_array": true,
"metrics": "corpus_distance.distance_measurement.string_similarity.jaro_vector_wrapper",
"alphabet_normalisation": true
},
"clusterisation_parameters": {
"data_name": "Modern Standard Slavic",
"outgroup": "Slovak",
"metrics": "default_metrics_name",
"classification_method": "upgma",
"store_path": "default"
}
}
Running the code
There are two ways of running the code: with prepared Jupyter Notebook, or independently.
Ready-made Jupyter Notebook
In the folder example
, there is a tutorial notebook that outlines the inner workings of the package.
Using your own file
After data and configuration are ready, open Python interpreter:
python
Run the following commands:
from corpus_distance.pipeline import perform_clusterisation
perform_clusterisation(PATH_TO_CONFIG)
PATH_TO_CONFIG here is a path to config.json
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corpus_distance-0.3.3.tar.gz
.
File metadata
- Download URL: corpus_distance-0.3.3.tar.gz
- Upload date:
- Size: 123.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98f28ded376a404b8d795179b70720999f6d8b61ddfd6752d14b873b4b6c3b4c |
|
MD5 | 9330cec56216d83dc66ccccf2b094d1b |
|
BLAKE2b-256 | d353aefaf72d7905ed72be44c2c729ff1d19436fed130a5488140ff497e300ee |
File details
Details for the file corpus_distance-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: corpus_distance-0.3.3-py3-none-any.whl
- Upload date:
- Size: 127.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c393f4b8d1730053dc8c541e99a844eec046817a618dc6436cee90034c8675e5 |
|
MD5 | d914abe3cba439d0e0b06d6d5369f225 |
|
BLAKE2b-256 | 13bb550c9bfa9480218da18fd4f4e7afc622ffdf3b64cca40a96532aadbeb03a |