Simple Similarity Service
Project description
simsity
Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!
This repository contains simple tools to help in similarity retrieval scenarios by making a convenient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.
Install
You can install simsity via pip.
python -m pip install simsity
Quickstart
This is the basic setup for this package.
from simsity.service import Service
from simsity.datasets import fetch_clinc
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
# The encoder defines how we encode the data going in.
encoder = make_pipeline(
ColumnLister(column="text"),
CountVectorizer()
)
# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)
# The service combines the two into a single object.
service_clinc = Service(
encoder=encoder,
indexer=indexer,
)
# We can now train the service using this data.
df_clinc = fetch_clinc()
# Important for later: we're only passing the 'text' column to encode
service_clinc.train_from_dataf(df_clinc, features=["text"])
# Query the datapoints
# Note that the keyword argument here refers to 'text'-column
service.query(text="give me directions", n_neighbors=20)
If you'd like you can also save and load the service on disk.
# Save the entire system
service.save("/tmp/simple-model")
# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")
You could even run it as a webservice if you were so inclined.
reloaded.serve(host='0.0.0.0', port=8080)
You can now POST to http://0.0.0.0:8080/query with payload:
{"query": {"text": "hello there"}, "n_neighbors": 20}
Note that the query content here refers to "text"
-column once again.
Examples
Check the examples
folder for some interesting use-cases and tool integrations.
In particular:
- benchmark.ipynb demonstrates an example on how you might benchmark simsity
- votes-example.ipynb demonstrates how to label similar data using pigeon and simsity
- text-widget-example.ipynb demonstrates how to add interactivity with ipywidgets
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for simsity-0.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c2b773248bb97e631b8a89e1ee12e5a7519b9a636a68b0160b4b33f2c845b60 |
|
MD5 | a45502c5aac50468934dcc1a50f20649 |
|
BLAKE2b-256 | a825f76a13f962f144db400883d682fcc97814e5918633b33b2fd6c3e3c5b0e1 |