Skip to main content

Classification algorithm based on finding shortest paths

Project description

Shortest Path Classification algorithm

Introduction

Suppose you have a dataset in which you know label of at least one – but not many more – label. Given this one labelled example, you want to classify all the other points in the dataset as either belonging to the class of that element, or the other class.

If your data is composed of (feature) vectors in and if you're not assuming anything about your data, and you're supposing the dataset is balanced, you might opt for the following classification rule: supposing you call known example's feature vector v, then compute the Euclidean distance of each element in the dataset and the vector v; the closer half is classified as the class of v, the farther half is classified as the other class.

That is not the case that is particularly interesting to solve, but consider now a similar problem: a dataset to which you have a reasonable metric in mind, but most elements are incomparable. (You can think of a metric whose domain has been extended with positive infinity.) What is, then, the intuitive counterpart of the above algorithm?

The answer that this repo proposes is to turn the dataset into a graph where each data point is a node, and the edges between them either have a finite positive weight if two data points are comparable; otherwise infinite weight (which is basically equivalent to them not being connected, but it is slighlty more convenient to put infinite weight to avoid cumbersome situations when the graph ends up unconnected). Then, find the shortest path from our known example to each of other data point; classify the closer half (in terms of the weight of the shortest path) to the known example's class, the farther half to the other class.

Local-global relationship

One of the reasons why I went to the trouble of implementing this model – besides the "because it was there" reason – is because I find it aesthetically pleasing how the model recovers global information from purely local relationships. This is something that seems somewhat absent in the rest of the machine learning (except in the trivial sense of models being trained on batches of data, etc.), so it seemed at least worth investigating. If you also find it aesthetically pleasing, see List of Local to Global principles. (I don't know of a really nice writeup of local-to-global principles that's not just about the number theoretical one, but, maybe one day–)

Demo

When I originally conceived this algorithm, I had tried it out on a set of my own Facebook messages – I had a bunch that were in Croatian and a bunch that were in English, so I taught the model to differentiate between them. Since I would rather not share my personal Facebook messages, the demonstration which I put in this repo is that of the model learning to differentiate between languages in the European Parliamentary Proceedings dataset. As you can see in the demo/languages.ipynb, the model does really well, approaching very close to 100% in a lot of language pairs and/or hyperparameter settings.

In order to run the demo/languages.ipynb notebook yourself, you have to first call the scripts which download and prepare the dataset. You need to execute these two scripts in this order:

  1. demo/dataset_utils/get_dataset.py
  2. demo/dataset_utils/extract_language_text.py

Installing

Just run

pip install shapaclass

Or alternatively, clone this repository. If you want to run the demo, you will have to clone the repository because only the algorithm part is on PiPy.

Dependencies

In order to run the algorithm itself, you need the following (these are installed automatically with pip)

  • NumPy (>= 1.19.2)
  • NetworkX (>= 2.5)

Additionally, to run the example provided in the GitHub repo, and all its constituent parts, you need

  • BeautifulSoup4 (>= 4.10.0)
  • ProgressBar33 (>= 2.4)
  • Matplotlib (>= 3.3.2)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shapaclass-0.1.2.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

shapaclass-0.1.2-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file shapaclass-0.1.2.tar.gz.

File metadata

  • Download URL: shapaclass-0.1.2.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for shapaclass-0.1.2.tar.gz
Algorithm Hash digest
SHA256 73240876951ade9dec9c17d6a49fe8e49178454a4dd71af837bce606184a048c
MD5 8880fbfd3e650ff212fe17520386f24b
BLAKE2b-256 daf3f2ecfb65b80bcc0d662672e0b2b18bfb7e8d143007093049bb0eef6efee9

See more details on using hashes here.

File details

Details for the file shapaclass-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: shapaclass-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for shapaclass-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 15b375d75014c99368c83dfc93cd0a2a5fbefa15b1019abd64b282f84ac0a4df
MD5 fe46af0978c3cee4bc7ba41a33aadf95
BLAKE2b-256 f95b51fbca3c4dc02bd8a11f3cc9395f9352ad531009dc0bd140db6014ccd6ae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page