Skip to main content

Tamil spell checker

Project description

Tamil Spell Checker

Idea for building simple Tamil Spell Checker came from a conversation with T Shrinivasan from open-tamil team.

Tamil Spell Checker uses below approach to suggest different spellings for a word

  • Check whether it is a valid Tamil word using Bloom Filter

  • Use Levenstein Distance (edit distance of 2) to suggest words when it is not a tamil word

Project Madurai Crawler

Project Madurai has good collection of tamil works. Use Project Madurai Crawler to generate Tamil unique word list.

To run it use the below command ` python ProjectMaduraiCrawler.py `

Create Bloom Filter File

Bloom Filter is a space efficient and compute optimized probablistic datastructure designed to tell whether an item is present in a set or not. More information on Bloom Filter can be found in [wiki](https://en.wikipedia.org/wiki/Bloom_filter).

  • Spellchecker is using Bloom Filter to check whether a word is a valid tamil word or not.

  • Bloom Filter Datastructure file has to be created first before using to check validity of a word

To generate Bloom Filter file use the below command

` python TamilBloomFilterCreator.py `

## Sample code to check whether a word is valid tamil word

` from TamilwordChecker import TamilwordChecker unique_word_count = 2043478 tamilwordchecker = TamilwordChecker(unique_word_count,"tamil_bloom_filter.txt") print(tamilwordchecker.tamil_word_exists("மேகம்")) `

Sample code to check get spell check corrections

` from TamilSpellingAutoCorrect import TamilSpellingAutoCorrect spellchecker = TamilSpellingAutoCorrect("tamil_bloom_filter.txt","tamilwordlist.txt") from_spell_checker_list = spellchecker.tamil_correct_spelling("மேக்ம்") print(from_spell_checker_list) `

Norvig Algorithm

Norvig algorithm can run faster than exhaustive search method; you can use it as follows,

from tamilspellchecker.TamilSpellingAutoCorrect import TamilSpellingAutoCorrect, get_data
from pprint import pprint
from tamil.utf8 import get_letters
spellchecker = TamilSpellingAutoCorrect(get_data("tamil_bloom_filter.txt"), get_data("tamilwordlist.txt"))
results = spellchecker.tamil_Norvig_correct_spelling("தமிழ்னாடு") #தமிழ்நாடு என்பது சரியான சொல்.
results = list(filter(lambda x: len(get_letters(x)) >= 4,results )) #filter for words >= 4 letters
results = list(filter(lambda x: len(get_letters(x)) <= 6,results )) #and for words <= 6 letters
pprint(results)
assert 'தமிழ்நாடு' in results

Accuracy Issues

Accuracy of Tamilwordchecker depends on the list of unique words that is there in tamilwordlist.txt. Need to add more unique words from other sources.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tamilspellchecker-0.10.tar.gz (34.4 MB view details)

Uploaded Source

File details

Details for the file tamilspellchecker-0.10.tar.gz.

File metadata

  • Download URL: tamilspellchecker-0.10.tar.gz
  • Upload date:
  • Size: 34.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.3

File hashes

Hashes for tamilspellchecker-0.10.tar.gz
Algorithm Hash digest
SHA256 0054e768d9a670e11077296f15fbaa5d6303ea46888c143ef51507bced0f813b
MD5 b35f745c79bf41e5714180d4df75ff21
BLAKE2b-256 e83ae8b1e700864c9633abec30117b8aaa7d9fe87c9a9303877aedf6b540f48e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page