Approximate String Matching using libsdcxx

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Programming Language
- Python :: 3
Topic
- Text Processing

Project description

Approximate (aka "fuzzy") matching of sequences of tokens, implemented on top of the libsdcxx library.

The library offers more high-level (or user-friendly) interfaces for libsdcxx use: . approxism.Extractor as the high-level, dictionary-based information extractor and . approxism.Matcher as the middle-ground, common approximate string matcher.

The project uses NLTK punkt sentence splitter.

In order to speed up the matching (or for other purposes), one may use collection of stop words for almost 50 languages. As the intended use is typically named-entity recognition, stop words are probably safe to disregard. Note though, that stop words are not removed from the text altogether; the matches are just prevented from beginning or ending by a stop word. Matches may still contain them (and they do contribute to the matching score).

Also, token lowercasing is provided (with option for keeping short acronyms in uppercase (in order to prevent matches with common short words). Injection of custom token transforms is supported.

See http://github.com/vencik/libsdcxx

Build and installation

Python v3.7 or newer is supported. For the Python package build, you shall need pip and Python distutils and the wheel package. If you wish to run Python UTs (which is highly recommended), you shall also need pytest.

E.g. on Debian-based (or similar, apt using) systems, the following should get you the required prerequisites unless you wish to use pyenv.

# apt-get install git # apt-get install python3-pip python3-distutils # unless you use pyenv $ pip install wheel pytest # ditto, better do that in pyenv sandbox</programlisting>

On Mac OS X, you’ll need Xcode tools and Homebrew. Then, install the required prerequisites by

$ brew install git</programlisting>

If you do wish to use pyenv to create and manage project sandbox (which is advisable), see short intro to that in the subsection below.

Anyhow, with all the prerequisites installed, clone the project:

$ git clone https://github.com/vencik/approxism.git\</programlisting>

Build the project, run UTs and build packages:

$ cd approxism $ ./build.sh -ugp</programlisting>

Note that the build.sh script has options; run $ ./build.sh -h to see them.

If you wish, use pip to install the Python package:

# pip install approxism-*.whl</programlisting>

Note that it’s recommended to use pyenv, especially for development purposes.

Managing project sandbox with `pyenv`

First install pyenv. You may use either your OS package repo (Homebrew on Mac OS X) or web pyenv installer. Setup pyenv (set environment variables) as instructed.

Then, create approxism project sandbox, thus:

$ pyenv install 3.9.6 # your choice of the Python interpreter version, >= 3.7 $ pyenv virtualenv 3.9.6 approxism</programlisting>

Now, you can always (de)activate the virtual environment (switch to the sandbox) by

# pyenv activate approxism # pyenv deactivate</programlisting>

In the sandbox, install Python packages using pip as usual.

$ pip install wheel pytest</programlisting>

Usage

High-level, NER-like extractor

A very simple use-case, assuming that you have collected interesting terms you need to search for in your texts in a JSON file with the following structure:

my_dictionary.json.

{
    "multi-word term" : {
        "_matching_threshold" : 0.8,
        "whatever" : "other fields you wish",
        "etc" : "really, whatever you like"
    },
    "another term..." : {
    }
}

The _matching_threshold field is optional, see approxism.Extractor.Dictionary.Record.

from dataclasses import dataclass
import json
from approxism import Extractor
from approxism.transforms import Lowercase

@dataclass
class MyRecord(Extractor.Dictionary.Record):
    whatever: str   # or whatever type you wish, as long as it's JSON (de)serialisable
    etc: str        # ditto

with open("my_dictionary.json", "r", encoding="utf-8") as json_fd:
    dictionary = {
        term: MyRecord(**record)
        for term, record in json.load(json_fd).items()
    }

extractor = Extractor(
    dictionary,
    default_threshold=0.8,
    language="english",
    token_transform=[Lowercase(min_len=4, except_caps=True)],
)

for match in extractor.extract(my_text):
    print(match)

Approximate string matching (generic) layer

from approxism import Matcher

matcher = Matcher("english")    # that's the default

for match in matcher.text("Hello world!").match("worl", 0.8):  # score >= 0.8 is required
    print(match)

# Of course, one may like to store the (pre-processed) text and/or patterns:

txt1 = matcher.text("My text about Sørensen–Dice coefficient.")     # text preprocessing
txt2 = matcher.text("And one about correlation coefficient.")
bgr1 = matcher.sequence_bigrams("Sørensen–Dice")                    # pattern preproc.
bgr2 = matcher.sequence_bigrams("coefficient")

for text in (txt1, txt2):
    print(f"Searching in \"{text}\"")
    for bgr in (bgr1, bgr2):
        for match in text.match(bgr):                               # pattern matching
            print(f"Found {bgr}: {match}")

# Long texts preprocessing produces relatively large data structures
# (space complexity is O(n^2) where n is number of tokens in a sentence).
# Matcher.text splits the text into sentences.
# However, if you already have the text split, or you prefer to process it
# sentence-by-sentence (which is recommended), you may use Matcher.sentences to split it
# and pre-process them using Matcher.sentence for matching:

for sentence in matcher.sentences(my_long_text):    # sentence string generator
    sentence = matcher.sentence(sentence)           # sentence preprocessing
    for bgr in (bgr1, bgr2):
        for match in sentence.match(bgr):           # pattern matching
            print(f"Found {bgr}: {match}")

# Should you like to lowercase tokens, simply pass the matcher token transform(s)
from approxism.transforms import Lowercase

matcher = Matcher(
    language="french",
    strip_stopwords=False,          # the default is True
    token_transform=[Lowercase()],  # lowercase tokens
)

# The lowercase transformer supports keeping short acronyms in uppercase:

Lowercase(min_len=4, except_caps=True)  # this will lowercase tokens of at least 4 chars,
                                        # but will also lowercase shorter ones UNLESS
                                        # they are in all CAPS so e.g. "AMI" (AWS machine
                                        # image) shall be kept as is and therefore won't
                                        # get mistaken for a friend...

# You may add more transforms of yours; just implement Matcher.TokenTransform interface

# Lastly, when specifying language, note that not all languages may be available.
# List of available tokeniser langauges is obtained by calling
from approxism import Tokeniser
Tokeniser.available()

# Similarly, list of available stop words languages is obtained by calling
from approxism import Stopwords
Stopwords.available()

# The matcher allows you to specify how to proceed if your language is not available.
# By default, an exception is thrown.
# However, passing strict_language=False parameter suppresses it, using default language
# for tokenisation (and no stopwords, if they are not available).
Matcher(language="martian", strict_language=False)

# The above shan't throw; instead, Matcher.default_language shall be used
# for tokenisation (and no stopwords whall be used, unless somebody collects Martian
# stop words any time soon... ;-))

For more precise/interesting examples of use, check out the matcher unit tests in src/approxism/unit_test/test_matcher.py.

License

The software is available open-source under the terms of 3-clause BSD license.

Author

Václav Krpec <vencik@razdva.cz>

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Programming Language
- Python :: 3
Topic
- Text Processing

Release history Release notifications | RSS feed

0.1.1

Apr 4, 2023

This version

0.1.0

Apr 3, 2023

0.0.1

Apr 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

approxism-0.1.0.tar.gz (67.0 kB view hashes)

Uploaded Apr 3, 2023 Source

Hashes for approxism-0.1.0.tar.gz

Hashes for approxism-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b7e5c083b64760bc5055b12945e3525b7a1958114aa5ae262e8f030cc68d6216`
MD5	`6bf33e21c12366bfa31135c31d10c059`
BLAKE2b-256	`ce06e2c3541a1ed916816cdc01305f644633be62c78ce566fa018d08c781e326`

approxism 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Build and installation

Managing project sandbox with `pyenv`

Usage

High-level, NER-like extractor

Approximate string matching (generic) layer

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

approxism 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Build and installation

Managing project sandbox with pyenv

Usage

High-level, NER-like extractor

Approximate string matching (generic) layer

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Managing project sandbox with `pyenv`