Skip to main content

Approximate String Matching using libsdcxx

Project description

Approximate (aka "fuzzy") matching of sequences of tokens, implemented on top of the libsdcxx library.

The library offers more high-level (or user-friendly) interfaces for libsdcxx use: . approxism.Extractor as the high-level, dictionary-based information extractor and . approxism.Matcher as the middle-ground, common approximate string matcher.

The project uses NLTK punkt sentence splitter.

In order to speed up the matching (or for other purposes), one may use collection of stop words for almost 50 languages. As the intended use is typically named-entity recognition, stop words are probably safe to disregard. Note though, that stop words are not removed from the text altogether; the matches are just prevented from beginning or ending by a stop word. Matches may still contain them (and they do contribute to the matching score).

Also, token lowercasing is provided (with option for keeping short acronyms in uppercase (in order to prevent matches with common short words). Injection of custom token transforms is supported.

See http://github.com/vencik/libsdcxx

Build and installation

Python v3.7 or newer is supported. For the Python package build, you shall need pip and Python distutils and the wheel package. If you wish to run Python UTs (which is highly recommended), you shall also need pytest.

E.g. on Debian-based (or similar, apt using) systems, the following should get you the required prerequisites unless you wish to use pyenv.

# apt-get install git # apt-get install python3-pip python3-distutils # unless you use pyenv $ pip install wheel pytest # ditto, better do that in pyenv sandbox</programlisting>

On Mac OS X, you’ll need Xcode tools and Homebrew. Then, install the required prerequisites by

$ brew install git</programlisting>

If you do wish to use pyenv to create and manage project sandbox (which is advisable), see short intro to that in the subsection below.

Anyhow, with all the prerequisites installed, clone the project:

$ git clone https://github.com/vencik/approxism.git\</programlisting>

Build the project, run UTs and build packages:

$ cd approxism $ ./build.sh -ugp</programlisting>

Note that the build.sh script has options; run $ ./build.sh -h to see them.

If you wish, use pip to install the Python package:

# pip install approxism-*.whl</programlisting>

Note that it’s recommended to use pyenv, especially for development purposes.

Managing project sandbox with pyenv

First install pyenv. You may use either your OS package repo (Homebrew on Mac OS X) or web pyenv installer. Setup pyenv (set environment variables) as instructed.

Then, create approxism project sandbox, thus:

$ pyenv install 3.9.6 # your choice of the Python interpreter version, >= 3.7 $ pyenv virtualenv 3.9.6 approxism</programlisting>

Now, you can always (de)activate the virtual environment (switch to the sandbox) by

# pyenv activate approxism # pyenv deactivate</programlisting>

In the sandbox, install Python packages using pip as usual.

$ pip install wheel pytest</programlisting>

Usage

High-level, NER-like extractor

A very simple use-case, assuming that you have collected interesting terms you need to search for in your texts in a JSON file with the following structure:

my_dictionary.json.

{
    "multi-word term" : {
        "_matching_threshold" : 0.8,
        "whatever" : "other fields you wish",
        "etc" : "really, whatever you like"
    },
    "another term..." : {
    }
}

The _matching_threshold field is optional, see approxism.Extractor.Dictionary.Record.

from dataclasses import dataclass
import json
from approxism import Extractor
from approxism.transforms import Lowercase

@dataclass
class MyRecord(Extractor.Dictionary.Record):
    whatever: str   # or whatever type you wish, as long as it's JSON (de)serialisable
    etc: str        # ditto

with open("my_dictionary.json", "r", encoding="utf-8") as json_fd:
    dictionary = {
        term: MyRecord(**record)
        for term, record in json.load(json_fd).items()
    }

extractor = Extractor(
    dictionary,
    default_threshold=0.8,
    language="english",
    token_transform=[Lowercase(min_len=4, except_caps=True)],
)

for match in extractor.extract(my_text):
    print(match)

Approximate string matching (generic) layer

from approxism import Matcher

matcher = Matcher("english")    # that's the default

for match in matcher.text("Hello world!").match("worl", 0.8):  # score >= 0.8 is required
    print(match)

# Of course, one may like to store the (pre-processed) text and/or patterns:

txt1 = matcher.text("My text about Sørensen–Dice coefficient.")     # text preprocessing
txt2 = matcher.text("And one about correlation coefficient.")
bgr1 = matcher.sequence_bigrams("Sørensen–Dice")                    # pattern preproc.
bgr2 = matcher.sequence_bigrams("coefficient")

for text in (txt1, txt2):
    print(f"Searching in \"{text}\"")
    for bgr in (bgr1, bgr2):
        for match in text.match(bgr):                               # pattern matching
            print(f"Found {bgr}: {match}")

# Long texts preprocessing produces relatively large data structures
# (space complexity is O(n^2) where n is number of tokens in a sentence).
# Matcher.text splits the text into sentences.
# However, if you already have the text split, or you prefer to process it
# sentence-by-sentence (which is recommended), you may use Matcher.sentences to split it
# and pre-process them using Matcher.sentence for matching:

for sentence in matcher.sentences(my_long_text):    # sentence string generator
    sentence = matcher.sentence(sentence)           # sentence preprocessing
    for bgr in (bgr1, bgr2):
        for match in sentence.match(bgr):           # pattern matching
            print(f"Found {bgr}: {match}")

# Should you like to lowercase tokens, simply pass the matcher token transform(s)
from approxism.transforms import Lowercase

matcher = Matcher(
    language="french",
    strip_stopwords=False,          # the default is True
    token_transform=[Lowercase()],  # lowercase tokens
)

# The lowercase transformer supports keeping short acronyms in uppercase:

Lowercase(min_len=4, except_caps=True)  # this will lowercase tokens of at least 4 chars,
                                        # but will also lowercase shorter ones UNLESS
                                        # they are in all CAPS so e.g. "AMI" (AWS machine
                                        # image) shall be kept as is and therefore won't
                                        # get mistaken for a friend...

# You may add more transforms of yours; just implement Matcher.TokenTransform interface

# Lastly, when specifying language, note that not all languages may be available.
# List of available tokeniser langauges is obtained by calling
from approxism import Tokeniser
Tokeniser.available()

# Similarly, list of available stop words languages is obtained by calling
from approxism import Stopwords
Stopwords.available()

# The matcher allows you to specify how to proceed if your language is not available.
# By default, an exception is thrown.
# However, passing strict_language=False parameter suppresses it, using default language
# for tokenisation (and no stopwords, if they are not available).
Matcher(language="martian", strict_language=False)

# The above shan't throw; instead, Matcher.default_language shall be used
# for tokenisation (and no stopwords whall be used, unless somebody collects Martian
# stop words any time soon... ;-))

For more precise/interesting examples of use, check out the matcher unit tests in src/approxism/unit_test/test_matcher.py.

License

The software is available open-source under the terms of 3-clause BSD license.

Author

Václav Krpec <vencik@razdva.cz>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

approxism-0.1.0.tar.gz (67.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page