Search the most similar strings against the query in Python 3.

These details have not been verified by PyPI

Project links

Repository

Project description

Featured on ImportPython Issue 171. Thank you so much for support!

Search the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency.

For both flexibility and efficiency, only set-based similarities are included right now, including Jaccard and Tversky.
For simpler code, some general purpose functions have been moved to be part of a new library extratools.
TopEmoji is an interesting application of this library, searching the most similar emojis against the query.

topemoji-cli "baby" -k 5

👶	baby	1.0
👼	baby angel	0.666
🐤	baby chick	0.666
🍼	baby bottle	0.6659
🚼	baby symbol	0.6659

Reference

This library is originally part of the implementation for our research paper.

Preference-driven similarity join.
Chuancong Gao, Jiannan Wang, Jian Pei, Rui Li, Yi Chang.
Proceedings of the International Conference on Web Intelligence, 2017.

Installation

This package is available on PyPI. Just use pip3 install -U TopSim to install it.

Note that starting with version 0.2.0, only Python 3.12+ is supported.

CLI Usage

You can simply use the algorithm on terminal.

Usage:
    topsim-cli <query> [options] [<file>]


Options:
    -I                     Case-sensitive matching.
    -k <k>                 Maximum number of search results. [default: 1]
    --tie                  Include all the results with the same similarity of the "k"-th result. May return more than "k" results.

    -s, --search           Search the query within each line rather than against the whole line, by preferring partial matching of the line.
                           Tversky similarity is used instead of Jaccard similarity.
    -e <e>                 Parameter for "tversky" similarity. [default: 0.001]

    --mapping=<mapping>    Map each string to a set of either "gram"s or "word"s. [default: gram]
    --numgrams=<numgrams>  Number of characters for each gram when mapping by "gram". [default: 2]

    --quiet                Do not print additional information to standard error.

The query is matched against each line of the input file (or standard input).

Each line and its similarity are separated by tab character \t.

API Usage

Alternatively, you can use the algorithm via API.

from topsim import TopSim

ts = TopSim([
    "python2",
    "python2.7",
    "python3",
    "python3.6",
])

# Return each similarity and the respective line numbers.
ts.search("python", k=3)

Please check code for more optional parameters, like similarity function, etc.

Examples

Search the most similar line.

ls /usr/bin | topsim-cli "top"

top	1.0

Search the three most similar lines.

ls /usr/bin | topsim-cli "top" -k 3

top	1.0
tops	0.5
iotop	0.4286

Use Jaccard similarity in default, which puts same weight on matching both the query and the lines.

ls /usr/bin | topsim-cli "git" -k 5

git	1.0
wait	0.2857
git-shell	0.2727
pluginkit	0.2727
kinit	0.25

Use Tversky similarity, which puts most weight on matching the query. Ideal when searching within long lines.

ls /usr/bin | topsim-cli "git" -k 5 -s

git	1.0
git-shell	0.7489
pluginkit	0.7489
git-cvsserver	0.7481
git-upload-pack	0.7478

For n-gram mapping, higher number of n for can result in better accuracy but fewer matches.

ls /usr/bin | topsim-cli "git" -k 5 -s --numgrams=3

git	1.0
git-shell	0.5993
git-cvsserver	0.5988
git-upload-pack	0.5986
git-receive-pack	0.5984

Full support of Chinese/Japanese/Korean.

cat test

地三鲜
红烧肉
烤全牛
木须肉
土豆炖牛肉

cat test | topsim-cli "牛肉" -k 3 -s

土豆炖牛肉	0.666
红烧肉	0.3332
木须肉	0.3332

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.2.0

Apr 5, 2025

0.1.5

May 5, 2018

0.1.4

May 1, 2018

0.1.3

Apr 30, 2018

0.1

Apr 11, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topsim-0.2.0.tar.gz (52.0 kB view details)

Uploaded Apr 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

topsim-0.2.0-py3-none-any.whl (7.3 kB view details)

Uploaded Apr 5, 2025 Python 3

File details

Details for the file topsim-0.2.0.tar.gz.

File metadata

Download URL: topsim-0.2.0.tar.gz
Upload date: Apr 5, 2025
Size: 52.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.12

File hashes

Hashes for topsim-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e3c1ea84ff0c8e0c5b5f78d26bfe417e3b5f095444e5683f0a4a5cc00cd28b60`
MD5	`7be7c612c882ccd65e5a27db44ab3bfb`
BLAKE2b-256	`559d88b7d678c2b1c040ac38b7456d63748958f8c9bf54519e9c50d88f18bea6`

See more details on using hashes here.

File details

Details for the file topsim-0.2.0-py3-none-any.whl.

File metadata

Download URL: topsim-0.2.0-py3-none-any.whl
Upload date: Apr 5, 2025
Size: 7.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.12

File hashes

Hashes for topsim-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f4b8f8d86ccf9fb1e5f84b8bb5d90fa99329f0db299e3e8878d0dcbc1aab4d9d`
MD5	`f91d5e5f73f9e10ede9d9217657659b8`
BLAKE2b-256	`6347057dc9942b00327c555f25084c2c58dd93b673308881866e1bbdf2a8d9f5`

See more details on using hashes here.

TopSim 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Reference

Installation

CLI Usage

API Usage

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes