fining sentences from raw text and tokenizing
Project description
Tokenizer:
Transforms textdate into strucktured output.
API:
4 functions:
tocenize_sentence
[tokens, start_indexes, end_indexes] = tokenize_sentance(sentance, language)
Sentance: a sting, with sentance intrerpretation. E.g. 'Hallo, wie geht es dir?'
tokens: List of Stings: E.g. ['Hallo', ',' 'wie', 'geht', 'es', 'dir', '?'] start_indexes: List of indexes of the corresponding the first letter of the token: [0, 5, 7, 11, 16, 19, 22] end_indexes: List of indexes of the corresponding the last letter of the token: [4, 5, 9, 14, 17, 21, 22]
find sentance
sentance_list = find_sentances(raw_sting)
raw_sting = ' Hallo wie geht es dir? Heute ist ein schöner Tag! Das sehe ich auch so.',
sentance_list = ['Hallo wie geht es dir?', 'Heute ist ein schöner Tag!', 'Das sehe ich auch so.']
spellcheck
word_correct = spellcheck(word, language_dict, lower_case_languge dict)
word_correct = 'ich'
word = 'ick'
language_dict = { 'ich' : 10, 'heisse': 4, 'Hans': 3 , 'Hanf':1, 'gehen': 2, }
lower_case_languge = { 'ich' : 10, 'heisse': 4, 'hans': 3 , 'hanf':1, 'gehen': 2, }
(values of languge dicts are occurance number)
determine_language:
probable_lang = determine_language(word_list, language_col_dict)
probable_lang = 'DE'
word_list = ['ich']
language_col_dict = { 'DE':DE, 'EN':EN }
(where DE, and EN are language dictionaries of the respective language)
ToDo:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tokenizer_cstm-0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cef73ba6a605decdfe1a59d7d1fb9799195e7e748c5f5b9e628808ac2de9f66a |
|
MD5 | a450862076ba7f56cb330417b38df422 |
|
BLAKE2b-256 | 04c2571341641e42f39e92efad85e174fce20596a5a1498b59bdd4639c0e539f |