Skip to main content

Fit & Predict Utils

Project description

fitxf

Simple math utility library

  • Basic math that don't exist in numpy as single function call
  • Simple math for optimal clusters not in sklearn
  • Simple transform wrappers to simplify tensor transforms or compression via clustering (Euclid distance or cosine) or PCA
  • allow to do searches using transformed data
  • allow to save/load transform model to/from string & fine-tuning
pip install fitxf

Basic Math

Cosine or dot similarity between multi-dim vectors

import numpy as np
from fitxf import TensorUtils
ts = TensorUtils()
x = np.random.rand(5,3)
y = np.random.rand(10,3)

# Cosine similarity, find closest matches of each vector in x
# with all vectors in ref
# For Euclidean distance, just replace with "similarity_distance"
ts.similarity_cosine(x=x, ref=x, return_tensors='np')
ts.similarity_cosine(x=y, ref=y, return_tensors='np')
matches, dotsim = ts.similarity_cosine(x=x, ref=y, return_tensors='np')
print("matches",matches)
print("dot similarities",dotsim)

Clustering

Auto clustering into optimal n clusters, via heuristic manner

Case 1: All clusters (think towns) are almost-equally spaced

  • in this case, suppose optimal cluster centers=n (think salesmen)
  • if number of clusters k<n, then each salesman need to cover a larger area, and their average distances from each other is smaller
  • if number of clusters k>n, then things become a bit crowded, with more than 1 salesman covering a single town
  • Thus at transition from n --> n+1 clusters, the average distance between cluster centers will decrease

Case 2: Some clusters are spaced much larger apart

In this case, there will be multiple turning points, and we may take an earlier turning point or later turning points

Optimal cluster by Euclidean Distance

from fitxf import Cluster
x = np.array([
    [5, 1, 1], [8, 2, 1], [6, 0, 2],
    [1, 5, 1], [2, 7, 1], [0, 6, 2],
    [1, 1, 5], [2, 1, 8], [0, 2, 6],
])
obj = Cluster()
obj.kmeans_optimal(
    x = x,
    estimate_min_max = True,
)

Optimal cluster by cosine distance

from fitxf import ClusterCosine
x = np.random.rand(20,3)
ClusterCosine().kmeans_optimal(x=x)

Fit Transform

Convenient wrapper

  • fit a set of vectors into compressed PCA, clusters, etc.
  • predict via cosine similarity, Euclidean distance of arbitrary vectors
  • fine tune

Sample code for basic training to transform data -

from fitxf import FitXformPca, FitXformCluster
import numpy as np
x = np.array([
    [5, 1, 1], [8, 2, 1], [6, 0, 2],
    [1, 5, 1], [2, 7, 1], [0, 6, 2],
    [1, 1, 5], [2, 1, 8], [0, 2, 6],
])
user_labels = [
    'a', 'a', 'a',
    'b', 'b', 'b',
    'c', 'c', 'c',
]
pca = FitXformPca()
res_fit_pca = pca.fit_optimal(X=x, X_labels=user_labels)
print('X now reduced to\n',res_fit_pca['X_transform'])

cls = FitXformCluster()
res_fit_cls = cls.fit_optimal(X=x, X_labels=user_labels)
print('X now reduced to\n',res_fit_cls['X_transform'])

pca.predict(X=x+np.random.rand(9,3))
cls.predict(X=x+np.random.rand(9,3))

From above code same you will see the original X of 3 dimensions were reduced to 2.

X now reduced to
 [[ 3.31282121 -0.0595825 ]
 [ 5.50813438  0.41970015]
 [ 4.04773021 -1.54803569]
 [-1.35557188  3.11843837]
 [-1.16795985  4.47103187]
 [-2.95485944  3.21899561]
 [-1.77834736 -2.52015183]
 [-2.23431234 -4.68080139]
 [-3.37763491 -2.41959458]]

whereas for clustering

X now reduced to
 [2 2 2 0 0 0 1 1 1]

where each point in 3 dimensions is represented by only a scalar center label.

Save Model to String & Load Back

Sample code to save and load model -

# Save this Base64 string somewhere
model_save = pca.model_to_b64json(numpy_to_base64_str=True, dump_to_b64json_str=True)

# Load back into new instance
new = FitXformPca()
new.load_model_from_b64json(model_b64json=model_save)
new.predict(X=x+np.random.rand(9,3))

Fine Tuning

After saving models & loading back, one can fine tune with new points.

Test fine tune with same centers and data

cls = FitXformCluster()
res_fit_cls = cls.fit_optimal(X=x, X_labels=user_labels)
centers = res_fit_cls["centers"]

res = cls.fine_tune(
   X = x,
   X_labels = user_labels,
   n_components = 3,
)
[print(k,v) for k,v in res.items()]
print('Expect 1 iteration, got ', res["n_iter"])

print('old centers',centers)
print('new centers',res['centers'])

Fine tune with new points

x = np.array([
    [5, 1, 1], [8, 2, 1], [6, 0, 2],
    [1, 5, 1], [2, 7, 1], [0, 6, 2],
    # remove last 2 points
    [1, 1, 5], # [2, 1, 8], [0, 2, 6],
    # new points
    [4, 2, 1], [0, 6, 2], [1, 1, 7],
])
user_labels = [
    'a', 'a', 'a',
    'b', 'b', 'b',
    # remove last 2 points
    'c', # 'c', 'c',
    'a', 'b', 'c',
]
res_fit_new = cls.fine_tune(X=x, X_labels=user_labels, n_components=3)

print('old centers',centers)
print('new centers',res_fit_new['centers'])

Tensor DB Models

The math is simple, but the technical implementation is orders of magnitude far more complicated. Transformed data needs to be updated to storage for subsequent level searches, model & data needs to be kept in sync, etc.

To be updated..

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitxf-0.2.53.tar.gz (256.2 kB view hashes)

Uploaded Source

Built Distribution

fitxf-0.2.53-py3-none-any.whl (277.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page