Skip to main content

Search the most similar strings against the query in Python 3.

Project description

PyPI version PyPI pyversions PyPI license

Search the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency. For both flexibility and efficiency, only set-based similarities are supported right now, including Jaccard and Tversky.

Installation

This package is available on PyPI. Just use pip3 install -U TopSim to install it.

CLI Usage

You can simply use the algorithm on terminal.

Usage:
    topsim-cli <query> [options] [<file>]


Options:
    -I                     Case-sensitive matching.
    -k <k>                 Maximum number of search results. [default: 1]
    --tie                  Include all the results with the same similarity of the "k"-th result. May return more than "k" results.

    -s, --search           Search the query within each line rather than against the whole line, by preferring partial matching of the line.
                           Tversky similarity is used instead of Jaccard similarity.
    -e <e>                 Parameter for "tversky" similarity. [default: 0.001]

    --mapping=<mapping>    Map each string to a set of either "gram"s or "word"s. [default: gram]
    --numgrams=<numgrams>  Number of characters for each gram when mapping by "gram". [default: 2]

    --quiet                Do not print additional information to standard error.
  • The query is matched against each line of the input file (or standard input).
  • Each line and its similarity are separated by tab character \t.

API Usage

Alternatively, you can use the algorithm via API.

from topsim import TopSim

ts = TopSim([
    "python2",
    "python2.7",
    "python3",
    "python3.6",
])

print(ts.search("python", k=3)) # Return each similarity and the respective line numbers.
  • Please check topsim.py for more optional parameters, like similarity function, etc.

Examples

  • Search the most similar line.

ls /usr/bin | topsim-cli "top"

top	1.0
  • Search the three most similar lines.

ls /usr/bin | topsim-cli "top" -k 3

top	1.0
tops	0.5
iotop	0.4286
  • Use Jaccard similarity in default, which puts same weight on matching both the query and the lines.

ls /usr/bin | topsim-cli "git" -k 5

git	1.0
wait	0.2857
git-shell	0.2727
pluginkit	0.2727
kinit	0.25
  • Use Tversky similarity, which puts most weight on matching the query. Ideal when searching within long lines.

ls /usr/bin | topsim-cli "git" -k 5 -s

git	1.0
git-shell	0.7489
pluginkit	0.7489
git-cvsserver	0.7481
git-upload-pack	0.7478
  • For n-gram mapping, higher number of n for can result in better accuracy but fewer matches.

ls /usr/bin | topsim-cli "git" -k 5 -s --numgrams=3

git	1.0
git-shell	0.5993
git-cvsserver	0.5988
git-upload-pack	0.5986
git-receive-pack	0.5984
  • Full support of Chinese/Japanese/Korean.

cat test

地三鲜
红烧肉
烤全牛
木须肉
土豆炖牛肉

cat test | topsim-cli "牛肉" -k 3 -s

土豆炖牛肉	0.666
红烧肉	0.3332
木须肉	0.3332

Tip

I strongly encourage using PyPy instead of CPython to run the script for best performance.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
TopSim-0.1.5.tar.gz (5.7 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page