Classification algorithm based on finding shortest paths
Project description
Shortest Path Classification algorithm
Introduction
Suppose you have a dataset in which you know label of at least one – but not many more – label. Given this one labelled example, you want to classify all the other points in the dataset as either belonging to the class of that element, or the other class.
If your data is composed of (feature) vectors in and if you're not assuming anything about your data, and you're supposing the dataset is balanced, you might opt for the following classification rule: supposing you call known example's feature vector v, then compute the Euclidean distance of each element in the dataset and the vector v; the closer half is classified as the class of v, the farther half is classified as the other class.
That is not the case that is particularly interesting to solve, but consider now a similar problem: a dataset to which you have a reasonable metric in mind, but most elements are incomparable. (You can think of a metric whose domain has been extended with positive infinity.) What is, then, the intuitive counterpart of the above algorithm?
The answer that this repo proposes is to turn the dataset into a graph where each data point is a node, and the edges between them either have a finite positive weight if two data points are comparable; otherwise infinite weight (which is basically equivalent to them not being connected, but it is slighlty more convenient to put infinite weight to avoid cumbersome situations when the graph ends up unconnected). Then, find the shortest path from our known example to each of other data point; classify the closer half (in terms of the weight of the shortest path) to the known example's class, the farther half to the other class.
Local-global relationship
One of the reasons why I went to the trouble of implementing this model – besides the "because it was there" reason – is because I find it aesthetically pleasing how the model recovers global information from purely local relationships. This is something that seems somewhat absent in the rest of the machine learning (except in the trivial sense of models being trained on batches of data, etc.), so it seemed at least worth investigating. If you also find it aesthetically pleasing, see List of Local to Global principles. (I don't know of a really nice writeup of local-to-global principles that's not just about the number theoretical one, but, maybe one day–)
Demo
When I originally conceived this algorithm, I had tried it out on a set of my own Facebook messages – I had a bunch that were in Croatian and a bunch that were in English, so I taught the model to differentiate between them. Since I would rather not share my personal Facebook messages, the demonstration which I put in this repo is that of the model learning to differentiate between languages in the European Parliamentary Proceedings dataset. As you can see in the demo/languages.ipynb, the model does really well, approaching very close to 100% in a lot of language pairs and/or hyperparameter settings.
In order to run the demo/languages.ipynb notebook yourself, you have to first call the scripts which download and prepare the dataset. You need to execute these two scripts in this order:
- demo/dataset_utils/get_dataset.py
- demo/dataset_utils/extract_language_text.py
Installing
Just run
pip install shapaclass
Or alternatively, clone this repository. If you want to run the demo, you will have to clone the repository because only the algorithm part is on PiPy.
Dependencies
In order to run the algorithm itself, you need the following (these are installed automatically with pip)
- NumPy (>= 1.19.2)
- NetworkX (>= 2.5)
Additionally, to run the example provided in the GitHub repo, and all its constituent parts, you need
- BeautifulSoup4 (>= 4.10.0)
- ProgressBar33 (>= 2.4)
- Matplotlib (>= 3.3.2)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file shapaclass-0.1.2.tar.gz
.
File metadata
- Download URL: shapaclass-0.1.2.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73240876951ade9dec9c17d6a49fe8e49178454a4dd71af837bce606184a048c |
|
MD5 | 8880fbfd3e650ff212fe17520386f24b |
|
BLAKE2b-256 | daf3f2ecfb65b80bcc0d662672e0b2b18bfb7e8d143007093049bb0eef6efee9 |
File details
Details for the file shapaclass-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: shapaclass-0.1.2-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15b375d75014c99368c83dfc93cd0a2a5fbefa15b1019abd64b282f84ac0a4df |
|
MD5 | fe46af0978c3cee4bc7ba41a33aadf95 |
|
BLAKE2b-256 | f95b51fbca3c4dc02bd8a11f3cc9395f9352ad531009dc0bd140db6014ccd6ae |