Skip to main content

Search the most similar strings against the query in Python 3.

Project description

PyPI version PyPI pyversions PyPI license

Featured on ImportPython Issue 171. Thank you so much for support!

Search the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency.

  • For both flexibility and efficiency, only set-based similarities are included right now, including Jaccard and Tversky.

  • For simpler code, some general purpose functions have been moved to be part of a new library extratools.

  • TopEmoji is an interesting application of this library, searching the most similar emojis against the query.

topemoji-cli "baby" -k 5
👶	baby	1.0
👼	baby angel	0.666
🐤	baby chick	0.666
🍼	baby bottle	0.6659
🚼	baby symbol	0.6659

Reference

This library is originally part of the implementation for our research paper.

Preference-driven similarity join.
Chuancong Gao, Jiannan Wang, Jian Pei, Rui Li, Yi Chang.
Proceedings of the International Conference on Web Intelligence, 2017.

Installation

This package is available on PyPI. Just use pip3 install -U TopSim to install it.

Note that starting with version 0.2.0, only Python 3.12+ is supported.

CLI Usage

You can simply use the algorithm on terminal.

Usage:
    topsim-cli <query> [options] [<file>]


Options:
    -I                     Case-sensitive matching.
    -k <k>                 Maximum number of search results. [default: 1]
    --tie                  Include all the results with the same similarity of the "k"-th result. May return more than "k" results.

    -s, --search           Search the query within each line rather than against the whole line, by preferring partial matching of the line.
                           Tversky similarity is used instead of Jaccard similarity.
    -e <e>                 Parameter for "tversky" similarity. [default: 0.001]

    --mapping=<mapping>    Map each string to a set of either "gram"s or "word"s. [default: gram]
    --numgrams=<numgrams>  Number of characters for each gram when mapping by "gram". [default: 2]

    --quiet                Do not print additional information to standard error.
  • The query is matched against each line of the input file (or standard input).
  • Each line and its similarity are separated by tab character \t.

API Usage

Alternatively, you can use the algorithm via API.

from topsim import TopSim

ts = TopSim([
    "python2",
    "python2.7",
    "python3",
    "python3.6",
])

# Return each similarity and the respective line numbers.
ts.search("python", k=3)
  • Please check code for more optional parameters, like similarity function, etc.

Examples

  • Search the most similar line.

ls /usr/bin | topsim-cli "top"

top	1.0
  • Search the three most similar lines.

ls /usr/bin | topsim-cli "top" -k 3

top	1.0
tops	0.5
iotop	0.4286
  • Use Jaccard similarity in default, which puts same weight on matching both the query and the lines.

ls /usr/bin | topsim-cli "git" -k 5

git	1.0
wait	0.2857
git-shell	0.2727
pluginkit	0.2727
kinit	0.25
  • Use Tversky similarity, which puts most weight on matching the query. Ideal when searching within long lines.

ls /usr/bin | topsim-cli "git" -k 5 -s

git	1.0
git-shell	0.7489
pluginkit	0.7489
git-cvsserver	0.7481
git-upload-pack	0.7478
  • For n-gram mapping, higher number of n for can result in better accuracy but fewer matches.

ls /usr/bin | topsim-cli "git" -k 5 -s --numgrams=3

git	1.0
git-shell	0.5993
git-cvsserver	0.5988
git-upload-pack	0.5986
git-receive-pack	0.5984
  • Full support of Chinese/Japanese/Korean.

cat test

地三鲜
红烧肉
烤全牛
木须肉
土豆炖牛肉

cat test | topsim-cli "牛肉" -k 3 -s

土豆炖牛肉	0.666
红烧肉	0.3332
木须肉	0.3332

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topsim-0.2.0.tar.gz (52.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topsim-0.2.0-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file topsim-0.2.0.tar.gz.

File metadata

  • Download URL: topsim-0.2.0.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.12

File hashes

Hashes for topsim-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e3c1ea84ff0c8e0c5b5f78d26bfe417e3b5f095444e5683f0a4a5cc00cd28b60
MD5 7be7c612c882ccd65e5a27db44ab3bfb
BLAKE2b-256 559d88b7d678c2b1c040ac38b7456d63748958f8c9bf54519e9c50d88f18bea6

See more details on using hashes here.

File details

Details for the file topsim-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: topsim-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.12

File hashes

Hashes for topsim-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4b8f8d86ccf9fb1e5f84b8bb5d90fa99329f0db299e3e8878d0dcbc1aab4d9d
MD5 f91d5e5f73f9e10ede9d9217657659b8
BLAKE2b-256 6347057dc9942b00327c555f25084c2c58dd93b673308881866e1bbdf2a8d9f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page