Hubness reduction and analysis tools

These details have not been verified by PyPI

Project links

Project description

scikit-hubness

scikit-hubness comprises tools for the analysis and reduction of hubness in high-dimensional data. Hubness is an aspect of the curse of dimensionality and is detrimental to many machine learning and data mining tasks.

The skhubness.analysis and skhubness.reduction packages allow to

analyze, whether your data sets show hubness
reduce hubness via a variety of different techniques
perform downstream analysis (performance assessment) with scikit-learn due to compatible data structures

The skhubness.neighbors package acts as a drop-in replacement for sklearn.neighbors. In addition to the functionality inherited from scikit-learn, it also features

approximate nearest neighbor search
hubness reduction
and combinations,

which allows for fast hubness-reduced neighbor search in large datasets (tested with >1M objects).

We follow the API conventions and code style of scikit-learn.

Installation

Make sure you have a working Python3 environment (at least 3.7).

Use pip to install the latest stable version of scikit-hubness from PyPI:

pip install scikit-hubness

Dependencies are installed automatically, if necessary. scikit-hubness requires numpy, scipy and scikit-learn. Approximate nearest neighbor search and approximate hubness reduction additionally requires at least one of the following packges:

nmslib for hierachical navigable small-world graphs ('hnsw')
ngtpy for nearest neighbor graphs ('nng'), and variants (ANNG, ONNG)
puffinn for locality-sensitive hashing ('lsh')
falconn for alternative LSH ('falconn_lsh') , or
annoy for random projection forests ('rptree').

Some modules require tqdm or joblib. All these packages are available from open repositories, such as PyPI.

For more details and alternatives, please see the Installation instructions.

Documentation

Documentation is available online: http://scikit-hubness.readthedocs.io/en/latest/index.html

What's new

See the changelog to find what's new in the latest package version.

Quickstart

Users of scikit-hubness may want to

analyse, whether their data show hubness
reduce hubness
perform learning (classification, regression, ...)

The following example shows all these steps for an example dataset from the text domain (dexter). (Please make sure you have installed scikit-hubness).

# load the example dataset 'dexter'
from skhubness.data import load_dexter
X, y = load_dexter()

# dexter is embedded in a high-dimensional space,
# and could, thus, be prone to hubness
print(f'X.shape = {X.shape}, y.shape={y.shape}')

# assess the actual degree of hubness in dexter
from skhubness import Hubness
hub = Hubness(k=10, metric='cosine')
hub.fit(X)
k_skew = hub.score()
print(f'Skewness = {k_skew:.3f}')

# additional hubness indices are available, for example:
print(f'Robin hood index: {hub.robinhood_index:.3f}')
print(f'Antihub occurrence: {hub.antihub_occurrence:.3f}')
print(f'Hub occurrence: {hub.hub_occurrence:.3f}')

# There is considerable hubness in dexter.
# Let's see, whether hubness reduction can improve
# kNN classification performance 
from sklearn.model_selection import cross_val_score
from skhubness.neighbors import KNeighborsClassifier

# vanilla kNN
knn_standard = KNeighborsClassifier(n_neighbors=5,
                                    metric='cosine')
acc_standard = cross_val_score(knn_standard, X, y, cv=5)

# kNN with hubness reduction (mutual proximity)
knn_mp = KNeighborsClassifier(n_neighbors=5,
                              metric='cosine',
                              hubness='mutual_proximity')
acc_mp = cross_val_score(knn_mp, X, y, cv=5)

print(f'Accuracy (vanilla kNN): {acc_standard.mean():.3f}')
print(f'Accuracy (kNN with hubness reduction): {acc_mp.mean():.3f}')

# Accuracy was considerably improved by mutual proximity.
# Did it actually reduce hubness?
hub_mp = Hubness(k=10, metric='cosine',
                 hubness='mutual_proximity')
hub_mp.fit(X)
k_skew_mp = hub_mp.score()
print(f'Skewness after MP: {k_skew_mp:.3f} '
      f'(reduction of {k_skew - k_skew_mp:.3f})')
print(f'Robin hood: {hub_mp.robinhood_index:.3f} '
      f'(reduction of {hub.robinhood_index - hub_mp.robinhood_index:.3f})')

# The neighbor graph can also be created directly,
# with or without hubness reduction
from skhubness.neighbors import kneighbors_graph
neighbor_graph = kneighbors_graph(X, n_neighbors=5, hubness='mutual_proximity')

Check the User Guide for additional example usage.

Development

The developers of scikit-hubness welcome all kinds of contributions! Get in touch with us if you have comments, would like to see an additional feature implemented, would like to contribute code or have any other kind of issue. Don't hesitate to file an issue here on GitHub.

For more information about contributing, please have a look at the contributors guidelines.

(c) 2018-2019, Roman Feldbauer
Austrian Research Institute for Artificial Intelligence (OFAI) and
University of Vienna, Division of Computational Systems Biology (CUBE)
Contact: <roman.feldbauer@univie.ac.at>

Citation

A software publication paper is currently in preparation. Until then, if you use scikit-hubness in your scientific publication, please cite:

@INPROCEEDINGS{8588814,
author={R. {Feldbauer} and M. {Leodolter} and C. {Plant} and A. {Flexer}},
booktitle={2018 IEEE International Conference on Big Knowledge (ICBK)},
title={Fast Approximate Hubness Reduction for Large High-Dimensional Data},
year={2018},
volume={},
number={},
pages={358-367},
keywords={computational complexity;data analysis;data mining;mobile computing;public domain software;software packages;mobile device;open source software package;high-dimensional data mining;fast approximate hubness reduction;massive mobility data;linear complexity;quadratic algorithmic complexity;dimensionality curse;Complexity theory;Indexes;Estimation;Data mining;Approximation algorithms;Time measurement;curse of dimensionality;high-dimensional data mining;hubness;linear complexity;interpretability;smartphones;transport mode detection},
doi={10.1109/ICBK.2018.00055},
ISSN={},
month={Nov},}

The technical report Fast approximate hubness reduction for large high-dimensional data is available at OFAI.

Additional reading

Local and Global Scaling Reduce Hubs in Space, Journal of Machine Learning Research 2012, Link.

A comprehensive empirical comparison of hubness reduction in high-dimensional spaces, Knowledge and Information Systems 2018, DOI.

License

scikit-hubness is licensed under the terms of the BSD-3-Clause license. The skhubness.neighbors package was modified from sklearn.neighbors, distributed under the same license. Users can, therefore, safely use scikit-hubness in the same way they use scikit-learn.

Note: Individual files contain the following tag instead of the full license text.

    SPDX-License-Identifier: BSD-3-Clause

This enables machine processing of license information based on the SPDX License Identifiers that are here available: https://spdx.org/licenses/

Acknowledgements

Several parts of scikit-hubness adapt code from scikit-learn. We thank all the authors and contributors of this project for the tremendous work they have done.

PyVmMonitor is being used to support the development of this free open source software package. For more information go to http://www.pyvmmonitor.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.21.2

Jan 14, 2020

0.21.1

Dec 10, 2019

0.21.0

Nov 25, 2019

0.21.0a9 pre-release

Oct 30, 2019

0.21.0a8 pre-release

Sep 12, 2019

0.21.0a7 pre-release

Jul 17, 2019

0.21.0a6 pre-release

Jul 17, 2019

0.21.0a5 pre-release

Jul 17, 2019

0.21.0a4 pre-release

Jul 16, 2019

0.21.0a2 pre-release

Jul 16, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scikit_hubness-0.21.2-py3-py37-win_amd64.whl (401.0 kB view details)

Uploaded Jan 14, 2020 Python 3Windows x86-64

scikit_hubness-0.21.2-py3-none-any.whl (517.5 kB view details)

Uploaded Jan 14, 2020 Python 3

File details

Details for the file scikit_hubness-0.21.2-py3-py37-win_amd64.whl.

File metadata

Download URL: scikit_hubness-0.21.2-py3-py37-win_amd64.whl
Upload date: Jan 14, 2020
Size: 401.0 kB
Tags: Python 3, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.5

File hashes

Hashes for scikit_hubness-0.21.2-py3-py37-win_amd64.whl
Algorithm	Hash digest
SHA256	`09997223bd4e0e1a100433cc5054694c51b41fb30009207c3cca71eacc5bca6f`
MD5	`211bf0ba432c5971b35c334372e9dfa4`
BLAKE2b-256	`efa5180bb0b1b0ee5e76c90bf3e13eaf5ef0a479ea1e76a0eec73a94ae372012`

See more details on using hashes here.

File details

Details for the file scikit_hubness-0.21.2-py3-none-any.whl.

File metadata

Download URL: scikit_hubness-0.21.2-py3-none-any.whl
Upload date: Jan 14, 2020
Size: 517.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.5

File hashes

Hashes for scikit_hubness-0.21.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9fd87fc69fefad6a192edc0ded497b922391379c85ed435cb151f8bd68430471`
MD5	`6be95310ef862363f12110a1eb4257ca`
BLAKE2b-256	`038820f43676fbe1528fcfaff217601789a498c11e021404c6f24f48a73922b6`

See more details on using hashes here.

scikit-hubness 0.21.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scikit-hubness

Installation

Documentation

What's new

Quickstart

Development

Citation

Additional reading

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes