Tool for fuzzy searching in texts with historical language use and OCR/HTR errors
Project description
fuzzy-search
Fuzzy search module for searching lists of words in low quality OCR and HTR text.
Usage
from fuzzy_search.fuzzy_phrase_searcher import FuzzyPhraseSearcher
from fuzzy_search.fuzzy_phrase_model import PhraseModel
# highger matching thresholds for higher quality OCR/HTR (higher precision, recall should be good anyway)
# lower matching thresholds for lower quality OCR/HTR (higher recall, as that's the main problem)
config = {
"char_match_threshold": 0.8,
"ngram_threshold": 0.6,
"levenshtein_threshold": 0.8,
"ignorecase": False,
"ngram_size": 3,
"skip_size": 0,
}
# initialize a new searcher instance with the config
fuzzy_searcher = FuzzyPhraseSearcher(config)
# create a list of domain keywords and phrases
domain_phrases = [
# terms for the chair and attendants of a meeting
"PRAESIDE",
"PRAESENTIBUS",
# some weekdays in Latin
"Veneris",
"Mercuri",
# some date phrase where any date in January 1725 should match
"den .. Januarii 1725"
]
# create a PhraseModel object from the domain phrases
phrase_model = PhraseModel(phrases=domain_phrases)
# register the phrase model with the searcher
fuzzy_searcher.index_phrase_model(phrase_model)
# take some example texts: meetings of the Dutch States General in January 1725
text1 = "ie Veucris den 5. Januaris 1725. PR&ASIDE, Den Heere Bentinck. PRASENTIEBUS, De Heeren Jan Welderen , van Dam, Torck , met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam , vanden Boeizelaar , Raadtpenfionaris van Hoornbeeck , met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Welt-Vrieslandt. Velters, Ockere , Noey; van Hoorn , met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Renswoude , van Voor{t. Van Schwartzenbergh, vander Waayen, Vegilin Van I{elmuden. Van Iddekinge ‚ van Tamminga."
text2 = "Mercuri: den 10. Jangarii , | 1725. ia PRESIDE, Den Heere an Iddekinge. PRA&SENTIBUS, De Heeren /an Welderen , van Dam, van Wynbergen, Torck, met een extraordinaris Gedeputeerde uyt de Provincie van Gelderland. Van Maasdam , Raadtpenfionaris van Hoorn=beeck. Velters, Ockerfe, Noey. Taats van Amerongen, van Renswoude. Vander Waasen , Vegilin, ’ Bentinck, van I(elmaden. Van Tamminga."
The find_matches
method returns match objects:
# look for matches in the first example text
for match in fuzzy_searcher.find_matches(text1):
print(match)
Printing the matches directly yields the following output:
Match(phrase: "Veneris", variant: "Veneris",string: "Veucris", offset: 3)
Match(phrase: "den .. Januarii 1725", variant: "den .. Januarii 1725",string: "den 5. Januaris 1725.", offset: 11)
Match(phrase: "PRAESIDE", variant: "PRAESIDE",string: "PR&ASIDE,", offset: 33)
Match(phrase: "PRAESENTIBUS", variant: "PRAESENTIBUS",string: "PRASENTIEBUS,", offset: 63)
Alternatively, each match object can generate a JSON representation of the match containing all information:
# look for matches in the first example text
for match in fuzzy_searcher.find_matches(text1):
print(match.json())
This yields more detailed output:
{'match_keyword': 'Veneris', 'match_term': 'Veneris', 'match_string': 'Veucris', 'match_offset': 3, 'char_match': 0.7142857142857143, 'ngram_match': 0.625, 'levenshtein_distance': 0.7142857142857143}
{'match_keyword': 'den .. Januarii 1725', 'match_term': 'den .. Januarii 1725', 'match_string': 'den 5. Januaris 1725', 'match_offset': 11, 'char_match': 0.9, 'ngram_match': 0.8095238095238095, 'levenshtein_distance': 0.9}
{'match_keyword': 'PRAESIDE', 'match_term': 'PRAESIDE', 'match_string': 'PR&ASIDE', 'match_offset': 33, 'char_match': 0.875, 'ngram_match': 0.6666666666666666, 'levenshtein_distance': 0.75}
{'match_keyword': 'PRAESENTIBUS', 'match_term': 'PRAESENTIBUS', 'match_string': 'PRASENTIEBUS', 'match_offset': 63, 'char_match': 1.0, 'ngram_match': 0.7692307692307693, 'levenshtein_distance': 0.8333333333333334}
Running the searcher on the second text:
# look for matches in the second example text
for match in fuzzy_searcher.find_candidates(text2):
print(match.json())
This yields the following output:
{'phrase': 'Veneris', 'variant': 'Veneris', 'string': 'Veucris', 'offset': 3, 'match_scores': {'char_match': 0.7142857142857143, 'ngram_match': 0.625, 'levenshtein_similarity': 0.7142857142857143}}
{'phrase': 'den .. Januarii 1725', 'variant': 'den .. Januarii 1725', 'string': 'den 5. Januaris 1725.', 'offset': 11, 'match_scores': {'char_match': 0.95, 'ngram_match': 0.7619047619047619, 'levenshtein_similarity': 0.8571428571428572}}
{'phrase': 'PRAESIDE', 'variant': 'PRAESIDE', 'string': 'PR&ASIDE,', 'offset': 33, 'match_scores': {'char_match': 0.875, 'ngram_match': 0.5555555555555556, 'levenshtein_similarity': 0.6666666666666667}}
{'phrase': 'PRAESENTIBUS', 'variant': 'PRAESENTIBUS', 'string': 'PRASENTIEBUS,', 'offset': 63, 'match_scores': {'char_match': 1.0, 'ngram_match': 0.6923076923076923, 'levenshtein_similarity': 0.7692307692307692}}
Matches as Web Annotations
If texts are passed to find_matches
as dictionaries with an identifier, the resulting matches
include the text identifier and can generate Web Annotation representations:
# create a dictionary for the second text and add an identifier
text2_with_id = {
"text": text2,
"id": "urn:republic:3783_0076:page=151:para=4"
}
matches = fuzzy_searcher.find_matches(text2_with_id)
import json
# use json.dumps to pretty print the first match as Web Annotation
print(json.dumps(matches[0].as_web_anno(), indent=2))
Output:
{
"@context": "http://www.w3.org/ns/anno.jsonld",
"id": "cca6740d-e584-4322-b517-67d92e0e508a",
"type": "Annotation",
"motivation": "classifying",
"created": "2020-12-08T10:22:26.838154",
"generator": {
"id": "https://github.com/marijnkoolen/fuzzy-search",
"type": "Software",
"name": "FuzzySearcher"
},
"target": {
"source": "urn:republic:3783_0076:page=151:para=4",
"selector": {
"type": "TextPositionSelector",
"start": 0,
"end": 8
}
},
"body": {
"type": "Dataset",
"value": {
"match_phrase": "Mercurii",
"match_variant": "Mercurii",
"match_string": "Mercuri:",
"phrase_metadata": {
"phrase": "Mercurii"
}
}
}
}
Documentation To Do
- adding variant phrases and distractors
- multiple searchers and searching in the context of other matches
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzzy_search-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11add2195b13cc12f2f4e33e2bc8e52015afbc70360e4e4e87bc334cd8c05b91 |
|
MD5 | 88363f13968f7cfaf944c363d27402b9 |
|
BLAKE2b-256 | 20986f043d768564b3deb2e621cd03a056bb91c122a7b4b01888685b66ee5a9d |