Tool for fuzzy searching in texts with historical language use and OCR/HTR errors

These details have not been verified by PyPI

Project links

Homepage

Project description

fuzzy-search

Fuzzy search module for searching lists of words in low quality OCR and HTR text.

Usage

from fuzzy_search.fuzzy_phrase_searcher import FuzzyPhraseSearcher
from fuzzy_search.fuzzy_phrase_model import PhraseModel

# highger matching thresholds for higher quality OCR/HTR (higher precision, recall should be good anyway)
# lower matching thresholds for lower quality OCR/HTR (higher recall, as that's the main problem)
config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 3,
    "skip_size": 0,
}

# initialize a new searcher instance with the config
fuzzy_searcher = FuzzyPhraseSearcher(config)

# create a list of domain keywords and phrases
domain_phrases = [
    # terms for the chair and attendants of a meeting
    "PRAESIDE",
    "PRAESENTIBUS",
    # some weekdays in Latin
    "Veneris", 
    "Mercuri",
    # some date phrase where any date in January 1725 should match
    "den .. Januarii 1725"
]

# create a PhraseModel object from the domain phrases
phrase_model = PhraseModel(phrases=domain_phrases)

# register the phrase model with the searcher
fuzzy_searcher.index_phrase_model(phrase_model)

# take some example texts: meetings of the Dutch States General in January 1725
text1 = "ie Veucris den 5. Januaris 1725. PR&ASIDE, Den Heere Bentinck. PRASENTIEBUS, De Heeren Jan Welderen , van Dam, Torck , met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam , vanden Boeizelaar , Raadtpenfionaris van Hoornbeeck , met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Welt-Vrieslandt. Velters, Ockere , Noey; van Hoorn , met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Renswoude , van Voor{t. Van Schwartzenbergh, vander Waayen, Vegilin Van I{elmuden. Van Iddekinge ‚ van Tamminga."

text2 = "Mercuri: den 10. Jangarii , | 1725. ia PRESIDE, Den Heere an Iddekinge. PRA&SENTIBUS, De Heeren /an Welderen , van Dam, van Wynbergen, Torck, met een extraordinaris Gedeputeerde uyt de Provincie van Gelderland. Van Maasdam , Raadtpenfionaris van Hoorn=beeck. Velters, Ockerfe, Noey. Taats van Amerongen, van Renswoude. Vander Waasen , Vegilin, ’ Bentinck, van I(elmaden. Van Tamminga."

The find_matches method returns match objects:

# look for matches in the first example text
for match in fuzzy_searcher.find_matches(text1):
    print(match)

Printing the matches directly yields the following output:

Match(phrase: "Veneris", variant: "Veneris",string: "Veucris", offset: 3)
Match(phrase: "den .. Januarii 1725", variant: "den .. Januarii 1725",string: "den 5. Januaris 1725.", offset: 11)
Match(phrase: "PRAESIDE", variant: "PRAESIDE",string: "PR&ASIDE,", offset: 33)
Match(phrase: "PRAESENTIBUS", variant: "PRAESENTIBUS",string: "PRASENTIEBUS,", offset: 63)

Alternatively, each match object can generate a JSON representation of the match containing all information:

# look for matches in the first example text
for match in fuzzy_searcher.find_matches(text1):
    print(match.json())

This yields more detailed output:

{'match_keyword': 'Veneris', 'match_term': 'Veneris', 'match_string': 'Veucris', 'match_offset': 3, 'char_match': 0.7142857142857143, 'ngram_match': 0.625, 'levenshtein_distance': 0.7142857142857143}
{'match_keyword': 'den .. Januarii 1725', 'match_term': 'den .. Januarii 1725', 'match_string': 'den 5. Januaris 1725', 'match_offset': 11, 'char_match': 0.9, 'ngram_match': 0.8095238095238095, 'levenshtein_distance': 0.9}
{'match_keyword': 'PRAESIDE', 'match_term': 'PRAESIDE', 'match_string': 'PR&ASIDE', 'match_offset': 33, 'char_match': 0.875, 'ngram_match': 0.6666666666666666, 'levenshtein_distance': 0.75}
{'match_keyword': 'PRAESENTIBUS', 'match_term': 'PRAESENTIBUS', 'match_string': 'PRASENTIEBUS', 'match_offset': 63, 'char_match': 1.0, 'ngram_match': 0.7692307692307693, 'levenshtein_distance': 0.8333333333333334}

Running the searcher on the second text:

# look for matches in the second example text
for match in fuzzy_searcher.find_candidates(text2):
    print(match.json())

This yields the following output:

{'phrase': 'Veneris', 'variant': 'Veneris', 'string': 'Veucris', 'offset': 3, 'match_scores': {'char_match': 0.7142857142857143, 'ngram_match': 0.625, 'levenshtein_similarity': 0.7142857142857143}}
{'phrase': 'den .. Januarii 1725', 'variant': 'den .. Januarii 1725', 'string': 'den 5. Januaris 1725.', 'offset': 11, 'match_scores': {'char_match': 0.95, 'ngram_match': 0.7619047619047619, 'levenshtein_similarity': 0.8571428571428572}}
{'phrase': 'PRAESIDE', 'variant': 'PRAESIDE', 'string': 'PR&ASIDE,', 'offset': 33, 'match_scores': {'char_match': 0.875, 'ngram_match': 0.5555555555555556, 'levenshtein_similarity': 0.6666666666666667}}
{'phrase': 'PRAESENTIBUS', 'variant': 'PRAESENTIBUS', 'string': 'PRASENTIEBUS,', 'offset': 63, 'match_scores': {'char_match': 1.0, 'ngram_match': 0.6923076923076923, 'levenshtein_similarity': 0.7692307692307692}}

Matches as Web Annotations

If texts are passed to find_matches as dictionaries with an identifier, the resulting matches include the text identifier and can generate Web Annotation representations:

# create a dictionary for the second text and add an identifier
text2_with_id = {
    "text": text2,
    "id": "urn:republic:3783_0076:page=151:para=4"
}
matches = fuzzy_searcher.find_matches(text2_with_id)

import json

# use json.dumps to pretty print the first match as Web Annotation
print(json.dumps(matches[0].as_web_anno(), indent=2))

Output:

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "cca6740d-e584-4322-b517-67d92e0e508a",
  "type": "Annotation",
  "motivation": "classifying",
  "created": "2020-12-08T10:22:26.838154",
  "generator": {
    "id": "https://github.com/marijnkoolen/fuzzy-search",
    "type": "Software",
    "name": "FuzzySearcher"
  },
  "target": {
    "source": "urn:republic:3783_0076:page=151:para=4",
    "selector": {
      "type": "TextPositionSelector",
      "start": 0,
      "end": 8
    }
  },
  "body": {
    "type": "Dataset",
    "value": {
      "match_phrase": "Mercurii",
      "match_variant": "Mercurii",
      "match_string": "Mercuri:",
      "phrase_metadata": {
        "phrase": "Mercurii"
      }
    }
  }
}

Documentation To Do

adding variant phrases and distractors
multiple searchers and searching in the context of other matches

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.7.0

Mar 2, 2026

2.6.0

Jan 30, 2026

2.5.0

Apr 22, 2025

2.4.5

Jan 7, 2025

2.4.4

Jan 7, 2025

2.4.3

Jan 7, 2025

2.4.3a1 pre-release

Jan 7, 2025

2.4.3a0 pre-release

Jan 7, 2025

2.4.2

Jan 6, 2025

2.4.1

Dec 30, 2024

2.4.0

Dec 20, 2024

2.3.0

Nov 22, 2024

2.2.0

Oct 18, 2024

2.1.0

Dec 13, 2023

2.0.1a0 pre-release

Jul 20, 2023

2.0.0a0 pre-release

May 11, 2023

1.6.0

Apr 7, 2023

1.5.0

Feb 2, 2023

1.4.3

Sep 10, 2021

1.4.2

Aug 31, 2021

1.4.1

Aug 31, 2021

1.4.0

Aug 31, 2021

1.3.2

Jun 24, 2021

1.3.1

Jun 23, 2021

1.3.0

May 7, 2021

1.2.0

Apr 28, 2021

1.1.6

Feb 11, 2021

1.1.5

Feb 11, 2021

1.1.4

Jan 20, 2021

1.1.3

Jan 7, 2021

1.1.2

Jan 4, 2021

1.1.1

Jan 4, 2021

1.1.0

Dec 31, 2020

1.0.2

Dec 30, 2020

1.0.1

Dec 30, 2020

1.0.0

Dec 28, 2020

This version

0.2.0

Dec 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzy_search-0.2.0.tar.gz (30.9 kB view details)

Uploaded Dec 8, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fuzzy_search-0.2.0-py3-none-any.whl (32.7 kB view details)

Uploaded Dec 8, 2020 Python 3

File details

Details for the file fuzzy_search-0.2.0.tar.gz.

File metadata

Download URL: fuzzy_search-0.2.0.tar.gz
Upload date: Dec 8, 2020
Size: 30.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fuzzy_search-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ecbdab392e8db0f7b77844cd9eb925b4356fd5af86bdd273d919faeedbf735f6`
MD5	`917ad6396552d557595fe95d1f63f3c3`
BLAKE2b-256	`62719ad1149e1e746d8649c95d21be5419bbd7a43464ba60c0b22841f77b46d6`

See more details on using hashes here.

File details

Details for the file fuzzy_search-0.2.0-py3-none-any.whl.

File metadata

Download URL: fuzzy_search-0.2.0-py3-none-any.whl
Upload date: Dec 8, 2020
Size: 32.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for fuzzy_search-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11add2195b13cc12f2f4e33e2bc8e52015afbc70360e4e4e87bc334cd8c05b91`
MD5	`88363f13968f7cfaf944c363d27402b9`
BLAKE2b-256	`20986f043d768564b3deb2e621cd03a056bb91c122a7b4b01888685b66ee5a9d`

See more details on using hashes here.

fuzzy-search 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fuzzy-search

Usage

Matches as Web Annotations

Documentation To Do

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes