fining sentences from raw text and tokenizing

Project description

Tokenizer:

Transforms textdate into strucktured output.

API:

4 functions:

tocenize_sentence

[tokens, start_indexes, end_indexes] = tokenize_sentance(sentance, language)

Sentance: a sting, with sentance intrerpretation. E.g. 'Hallo, wie geht es dir?'

tokens: List of Stings: E.g. ['Hallo', ',' 'wie', 'geht', 'es', 'dir', '?'] start_indexes: List of indexes of the corresponding the first letter of the token: [0, 5, 7, 11, 16, 19, 22] end_indexes: List of indexes of the corresponding the last letter of the token: [4, 5, 9, 14, 17, 21, 22]

find sentance

sentance_list = find_sentances(raw_sting)

raw_sting = ' Hallo wie geht es dir? Heute ist ein schöner Tag! Das sehe ich auch so.',

sentance_list = ['Hallo wie geht es dir?', 'Heute ist ein schöner Tag!', 'Das sehe ich auch so.']

spellcheck

word_correct = spellcheck(word, language_dict, lower_case_languge dict)

word_correct = 'ich'

word = 'ick'

language_dict = { 'ich' : 10, 'heisse': 4, 'Hans': 3 , 'Hanf':1, 'gehen': 2, }

lower_case_languge = { 'ich' : 10, 'heisse': 4, 'hans': 3 , 'hanf':1, 'gehen': 2, }

(values of languge dicts are occurance number)

determine_language:

probable_lang = determine_language(word_list, language_col_dict)

probable_lang = 'DE'

word_list = ['ich']

language_col_dict = { 'DE':DE, 'EN':EN }

(where DE, and EN are language dictionaries of the respective language)

ToDo:

Project details

Release history Release notifications | RSS feed

0.7

May 25, 2019

0.6

Feb 25, 2019

This version

0.5

Feb 25, 2019

0.4

Feb 25, 2019

0.3

Feb 21, 2019

0.2

Feb 21, 2019

0.1

Feb 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer_cstm-0.5.tar.gz (6.6 kB view hashes)

Uploaded Feb 25, 2019 Source

Built Distribution

tokenizer_cstm-0.5-py3-none-any.whl (12.6 kB view hashes)

Uploaded Feb 25, 2019 Python 3

Hashes for tokenizer_cstm-0.5.tar.gz

Hashes for tokenizer_cstm-0.5.tar.gz
Algorithm	Hash digest
SHA256	`f100faea927ebb60bb85fe9d12990f61220182df115e3c5436e718b1026192fd`
MD5	`e93e497f50ca5fdffa82cc61a2f38b66`
BLAKE2b-256	`d97bd66cca9ab16abe8c028490e824d6086e85a531b8e79684706e4e62980e7f`

Hashes for tokenizer_cstm-0.5-py3-none-any.whl

Hashes for tokenizer_cstm-0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cef73ba6a605decdfe1a59d7d1fb9799195e7e748c5f5b9e628808ac2de9f66a`
MD5	`a450862076ba7f56cb330417b38df422`
BLAKE2b-256	`04c2571341641e42f39e92efad85e174fce20596a5a1498b59bdd4639c0e539f`