Skip to main content

Tool for fuzzy searching in texts with historical language use and OCR/HTR errors

Project description

fuzzy-search

Fuzzy search module for searching lists of words in low quality OCR and HTR text.

Usage

from fuzzy_search.fuzzy_phrase_searcher import FuzzyPhraseSearcher
from fuzzy_search.fuzzy_phrase_model import PhraseModel

# highger matching thresholds for higher quality OCR/HTR (higher precision, recall should be good anyway)
# lower matching thresholds for lower quality OCR/HTR (higher recall, as that's the main problem)
config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 3,
    "skip_size": 0,
}

# initialize a new searcher instance with the config
fuzzy_searcher = FuzzyPhraseSearcher(config)

# create a list of domain keywords and phrases
domain_phrases = [
    # terms for the chair and attendants of a meeting
    "PRAESIDE",
    "PRAESENTIBUS",
    # some weekdays in Latin
    "Veneris", 
    "Mercuri",
    # some date phrase where any date in January 1725 should match
    "den .. Januarii 1725"
]

# create a PhraseModel object from the domain phrases
phrase_model = PhraseModel(phrases=domain_phrases)

# register the phrase model with the searcher
fuzzy_searcher.index_phrase_model(phrase_model)

# take some example texts: meetings of the Dutch States General in January 1725
text1 = "ie Veucris den 5. Januaris 1725. PR&ASIDE, Den Heere Bentinck. PRASENTIEBUS, De Heeren Jan Welderen , van Dam, Torck , met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam , vanden Boeizelaar , Raadtpenfionaris van Hoornbeeck , met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Welt-Vrieslandt. Velters, Ockere , Noey; van Hoorn , met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Renswoude , van Voor{t. Van Schwartzenbergh, vander Waayen, Vegilin Van I{elmuden. Van Iddekinge ‚ van Tamminga."

text2 = "Mercuri: den 10. Jangarii , | 1725. ia PRESIDE, Den Heere an Iddekinge. PRA&SENTIBUS, De Heeren /an Welderen , van Dam, van Wynbergen, Torck, met een extraordinaris Gedeputeerde uyt de Provincie van Gelderland. Van Maasdam , Raadtpenfionaris van Hoorn=beeck. Velters, Ockerfe, Noey. Taats van Amerongen, van Renswoude. Vander Waasen , Vegilin, ’ Bentinck, van I(elmaden. Van Tamminga."

The find_matches method returns match objects:

# look for matches in the first example text
for match in fuzzy_searcher.find_matches(text1):
    print(match)

Printing the matches directly yields the following output:

Match(phrase: "Veneris", variant: "Veneris",string: "Veucris", offset: 3)
Match(phrase: "den .. Januarii 1725", variant: "den .. Januarii 1725",string: "den 5. Januaris 1725.", offset: 11)
Match(phrase: "PRAESIDE", variant: "PRAESIDE",string: "PR&ASIDE,", offset: 33)
Match(phrase: "PRAESENTIBUS", variant: "PRAESENTIBUS",string: "PRASENTIEBUS,", offset: 63)

Alternatively, each match object can generate a JSON representation of the match containing all information:

# look for matches in the first example text
for match in fuzzy_searcher.find_matches(text1):
    print(match.json())

This yields more detailed output:

{'match_keyword': 'Veneris', 'match_term': 'Veneris', 'match_string': 'Veucris', 'match_offset': 3, 'char_match': 0.7142857142857143, 'ngram_match': 0.625, 'levenshtein_distance': 0.7142857142857143}
{'match_keyword': 'den .. Januarii 1725', 'match_term': 'den .. Januarii 1725', 'match_string': 'den 5. Januaris 1725', 'match_offset': 11, 'char_match': 0.9, 'ngram_match': 0.8095238095238095, 'levenshtein_distance': 0.9}
{'match_keyword': 'PRAESIDE', 'match_term': 'PRAESIDE', 'match_string': 'PR&ASIDE', 'match_offset': 33, 'char_match': 0.875, 'ngram_match': 0.6666666666666666, 'levenshtein_distance': 0.75}
{'match_keyword': 'PRAESENTIBUS', 'match_term': 'PRAESENTIBUS', 'match_string': 'PRASENTIEBUS', 'match_offset': 63, 'char_match': 1.0, 'ngram_match': 0.7692307692307693, 'levenshtein_distance': 0.8333333333333334}

Running the searcher on the second text:

# look for matches in the second example text
for match in fuzzy_searcher.find_candidates(text2):
    print(match.json())

This yields the following output:

{'phrase': 'Veneris', 'variant': 'Veneris', 'string': 'Veucris', 'offset': 3, 'match_scores': {'char_match': 0.7142857142857143, 'ngram_match': 0.625, 'levenshtein_similarity': 0.7142857142857143}}
{'phrase': 'den .. Januarii 1725', 'variant': 'den .. Januarii 1725', 'string': 'den 5. Januaris 1725.', 'offset': 11, 'match_scores': {'char_match': 0.95, 'ngram_match': 0.7619047619047619, 'levenshtein_similarity': 0.8571428571428572}}
{'phrase': 'PRAESIDE', 'variant': 'PRAESIDE', 'string': 'PR&ASIDE,', 'offset': 33, 'match_scores': {'char_match': 0.875, 'ngram_match': 0.5555555555555556, 'levenshtein_similarity': 0.6666666666666667}}
{'phrase': 'PRAESENTIBUS', 'variant': 'PRAESENTIBUS', 'string': 'PRASENTIEBUS,', 'offset': 63, 'match_scores': {'char_match': 1.0, 'ngram_match': 0.6923076923076923, 'levenshtein_similarity': 0.7692307692307692}}

Matches as Web Annotations

If texts are passed to find_matches as dictionaries with an identifier, the resulting matches include the text identifier and can generate Web Annotation representations:

# create a dictionary for the second text and add an identifier
text2_with_id = {
    "text": text2,
    "id": "urn:republic:3783_0076:page=151:para=4"
}
matches = fuzzy_searcher.find_matches(text2_with_id)

import json

# use json.dumps to pretty print the first match as Web Annotation
print(json.dumps(matches[0].as_web_anno(), indent=2))

Output:

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "cca6740d-e584-4322-b517-67d92e0e508a",
  "type": "Annotation",
  "motivation": "classifying",
  "created": "2020-12-08T10:22:26.838154",
  "generator": {
    "id": "https://github.com/marijnkoolen/fuzzy-search",
    "type": "Software",
    "name": "FuzzySearcher"
  },
  "target": {
    "source": "urn:republic:3783_0076:page=151:para=4",
    "selector": {
      "type": "TextPositionSelector",
      "start": 0,
      "end": 8
    }
  },
  "body": {
    "type": "Dataset",
    "value": {
      "match_phrase": "Mercurii",
      "match_variant": "Mercurii",
      "match_string": "Mercuri:",
      "phrase_metadata": {
        "phrase": "Mercurii"
      }
    }
  }
}

Documentation To Do

  • adding variant phrases and distractors
  • multiple searchers and searching in the context of other matches

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzy_search-0.2.0.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fuzzy_search-0.2.0-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file fuzzy_search-0.2.0.tar.gz.

File metadata

  • Download URL: fuzzy_search-0.2.0.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fuzzy_search-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ecbdab392e8db0f7b77844cd9eb925b4356fd5af86bdd273d919faeedbf735f6
MD5 917ad6396552d557595fe95d1f63f3c3
BLAKE2b-256 62719ad1149e1e746d8649c95d21be5419bbd7a43464ba60c0b22841f77b46d6

See more details on using hashes here.

File details

Details for the file fuzzy_search-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fuzzy_search-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fuzzy_search-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11add2195b13cc12f2f4e33e2bc8e52015afbc70360e4e4e87bc334cd8c05b91
MD5 88363f13968f7cfaf944c363d27402b9
BLAKE2b-256 20986f043d768564b3deb2e621cd03a056bb91c122a7b4b01888685b66ee5a9d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page