Search the most similar strings against the query in Python 3.
Search the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency. For both flexibility and efficiency, only set-based similarities are supported right now, including Jaccard and Tversky.
This package is available on PyPI. Just use
pip3 install -U TopSim to install it.
You can simply use the algorithm on terminal.
Usage: topsim-cli <query> [options] [<file>] Options: -I Case-sensitive matching. -k <k> Maximum number of search results. [default: 1] --tie Include all the results with the same similarity of the "k"-th result. May return more than "k" results. -s, --search Search the query within each line rather than against the whole line, by preferring partial matching of the line. Tversky similarity is used instead of Jaccard similarity. -e <e> Parameter for "tversky" similarity. [default: 0.001] --mapping=<mapping> Map each string to a set of either "gram"s or "word"s. [default: gram] --numgrams=<numgrams> Number of characters for each gram when mapping by "gram". [default: 2] --quiet Do not print additional information to standard error.
- The query is matched against each line of the input file (or standard input).
- Each line and its similarity are separated by tab character
Alternatively, you can use the algorithm via API.
from topsim import TopSim ts = TopSim([ "python2", "python2.7", "python3", "python3.6", ]) print(ts.search("python", k=3)) # Return each similarity and the respective line numbers.
- Please check
topsim.pyfor more optional parameters, like similarity function, etc.
- Search the most similar line.
ls /usr/bin | topsim-cli "top"
- Search the three most similar lines.
ls /usr/bin | topsim-cli "top" -k 3
top 1.0 tops 0.5 iotop 0.4286
- Use Jaccard similarity in default, which puts same weight on matching both the query and the lines.
ls /usr/bin | topsim-cli "git" -k 5
git 1.0 wait 0.2857 git-shell 0.2727 pluginkit 0.2727 kinit 0.25
- Use Tversky similarity, which puts most weight on matching the query. Ideal when searching within long lines.
ls /usr/bin | topsim-cli "git" -k 5 -s
git 1.0 git-shell 0.7489 pluginkit 0.7489 git-cvsserver 0.7481 git-upload-pack 0.7478
n-gram mapping, higher number of
nfor can result in better accuracy but fewer matches.
ls /usr/bin | topsim-cli "git" -k 5 -s --numgrams=3
git 1.0 git-shell 0.5993 git-cvsserver 0.5988 git-upload-pack 0.5986 git-receive-pack 0.5984
- Full support of Chinese/Japanese/Korean.
地三鲜 红烧肉 烤全牛 木须肉 土豆炖牛肉
cat test | topsim-cli "牛肉" -k 3 -s
土豆炖牛肉 0.666 红烧肉 0.3332 木须肉 0.3332
I strongly encourage using PyPy instead of CPython to run the script for best performance.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size TopSim-0.1.5.tar.gz (5.7 kB)||File type Source||Python version None||Upload date||Hashes View hashes|