Skip to main content

Python library designed to perform fuzzy token matching within text documents. Utilizing advanced algorithms, this tool allows developers and data scientists to search and compare tokens based on flexible criteria, beyond exact matches. The library supports tokenization through whitespace, regular expressions, or custom functions, and provides weighted comparisons for nuanced analysis.

Project description

token-distance is a versatile library designed for fuzzy token searching within texts. It can be used as a standalone command line tool or integrated into other software as a library. This tool is particularly useful for applications in data mining, natural language processing, and information retrieval where matching based on exact tokens is insufficient.

The process begins by tokenizing the input texts, typically using whitespace, though regular expressions and custom functions can also be employed. Following tokenization, each token from the search query is assigned a weight that reflects its importance, which could depend on factors like token length or predefined criteria.

For each search token, token-distance identifies the most similar token in the target text based on these weights. The core of the library's functionality lies in how it calculates similarity: it pairs each search token with the best matching token in the target text and computes a weighted average of these pairings to produce a final similarity score.

The operations of token-distance are summarized in the chart below, which illustrates the step-by-step process from tokenization to the calculation of similarity scores.

Installation

Installation of token-distance is straightforward using pip, the Python package installer. This method ensures that the library and its dependencies are correctly configured. Ensure you have Python and pip installed on your system before proceeding.

pip install token-distance

Usage

token-distance is flexible, functioning both as a command-line tool and as a library for integration into your software.

Console

To compare two text files for token similarity, use the following command:

    token_distance_compare <path_to_token_file> <path_to_search_target_file>

For more complex tokenization, such as splitting text by commas or exclamation marks, you can use regular expressions:

    token_distance_compare <path_to_token_file> <path_to_search_target_file> \
    --tokenize-by "[\s,\.]" --regex 1

This command will tokenize the input texts at spaces, commas, and periods, enhancing the flexibility of the search.

As Library

token-distance can also be configured programmatically to suit specific needs, such as integrating custom similarity algorithms. Here's how you can set up a token distance calculation function using a configuration object:

from collections.abc import Callable
from token_distance import from_config, Config

calculate_distance: Callable[[str, str], float] = from_config(Config(mean='geometric'))

This configuration uses a geometric mean to compute the similarity score between tokens, which is useful for certain types of textual analysis.

token-distance can also obtain information about the actual matching of the tokens, if those are of interest:

from collections.abc import Callable, Collection
from token_distance import match_from_config, MatchConfig, RecordingToken

get_best_matches: Callable[[str, str], Collection[RecordingToken]] = match_from_config(MatchConfig())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

token_distance-0.2.3-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file token_distance-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for token_distance-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4d335ebaa96d013b85f697aa5143b1d349ef0e36e8e83e5f4d6bd3a6aefb889e
MD5 bf463519da6cb804c2629c132b910f1a
BLAKE2b-256 9362c4cbab4893615b229d2062bf94beb472f6a48130846d53fd1a3ae7bdbe60

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page