ANYKS Language Model (ALM)
Requirements
Install PyBind11
$ python3 -m pip install pybind11
Description of Methods
Methods:
- idw - Word ID Retrieval Method
- idt - Token ID retrieval method
Example:
>>> import alm
>>> alm.idt("1424")
2
>>> alm.idt("hello")
0
>>> alm.idw("hello")
1794085167
>>> alm.idw("<s>")
1
>>> alm.idw("</s>")
19
>>> alm.idw("<unk>")
3
Description
Name |
Description |
〈s〉 |
Sentence beginning token |
〈/s〉 |
Sentence end token |
〈url〉 |
URL-address token |
〈num〉 |
Number (arabic or roman) token |
〈unk〉 |
Unknown word token |
〈date〉 |
Date token (18.07.2004 ¦ 07/18/2004) |
〈time〉 |
Time token (15:44:56) |
〈abbr〉 |
Abbreviation token (1-й ¦ 2-е ¦ 20-я ¦ p.s ¦ p.s.) |
〈anum〉 |
Pseudo-number token (combination of numbers and other symbols) (T34 ¦ 895-M-86 ¦ 39km) |
〈math〉 |
Mathematical operation token (+ ¦ - ¦ = ¦ / ¦ * ¦ ^) |
〈range〉 |
Range of numbers token (1-2 ¦ 100-200 ¦ 300-400) |
〈aprox〉 |
Approximate number token (~93 ¦ 95.86 ¦ 1020) |
〈score〉 |
Score count token (4:3 ¦ 01:04) |
〈dimen〉 |
Dimensions token (200x300 ¦ 1920x1080) |
〈fract〉 |
Fraction token (5/20 ¦ 192/864) |
〈punct〉 |
Punctuation token (. ¦ ... ¦ , ¦ ! ¦ ? ¦ : ¦ ;) |
〈specl〉 |
Special character token (~ ¦ @ ¦ # ¦ № ¦ % ¦ & ¦ $ ¦ § ¦ © ) |
〈isolat〉 |
Isolation/quotation token (" ¦ ' ¦ « ¦ » ¦ „ ¦ “ ¦ ` ¦ ( ¦ ) ¦ [ ¦ ] ¦ { ¦ }) |
Methods:
- setZone - User zone set method
Example:
>>> import alm
>>> alm.setZone("com")
Methods:
- clear - Method clear all data
- setAlphabet - Method set alphabet
- getAlphabet - Method get alphabet
Example:
>>> import alm
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя'
>>> alm.clear()
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
Methods:
- setUnknown - Method set unknown word
Example:
>>> import alm
>>> alm.setUnknown("word")
Methods:
- getUnknown - Method extraction unknown word
Example:
>>> import alm
>>> alm.setUnknown("word")
>>> alm.getUnknown()
'word'
Methods:
- sentences - Sentences generation method
- readLM - Method for reading data from arpa file
- sentencesToFile - Method for assembling a specified number of sentences and writing to a file
Example:
>>> import alm
>>> def sentencesFn(text):
... print("Sentences:", text)
... return True
...
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.sentences(sentencesFn)
Sentences: <s> В общем </s>
Sentences: <s> С лязгом выкатился и остановился возле мальчика </s>
Sentences: <s> У меня нет </s>
Sentences: <s> Я вообще не хочу </s>
Sentences: <s> Да и в общем </s>
Sentences: <s> Не могу </s>
Sentences: <s> Ну в общем </s>
Sentences: <s> Так что я вообще не хочу </s>
Sentences: <s> Потому что я вообще не хочу </s>
Sentences: <s> Продолжение следует </s>
Sentences: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор </s>
>>> alm.sentencesToFile(5, "./result.txt")
Methods:
- findNgram - N-gram search method in text
Example:
>>> import alm
>>> def callbackFn(text):
... print(text)
...
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.findNgram("Особое место занимает чудотворная икона Лобзание Христа Иудою", callbackFn)
<s> Особое
Особое место
место занимает
занимает чудотворная
чудотворная икона
икона Лобзание
Лобзание Христа
Христа Иудою
Иудою </s>
>>>
Methods:
- setOption - Method for set module options
Example:
>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.mixdicts)
>>> alm.setOption(alm.options_t.onlyGood)
>>> alm.setOption(alm.options_t.confidence)
Methods:
- unsetOption - Disable module option method
Example:
>>> import alm
>>> alm.unsetOption(alm.options_t.debug)
>>> alm.unsetOption(alm.options_t.mixdicts)
>>> alm.unsetOption(alm.options_t.onlyGood)
>>> alm.unsetOption(alm.options_t.confidence)
Description
Name |
Description |
debug |
Flag debug mode |
mixdicts |
Flag allowing the use of words consisting of mixed dictionaries |
onlyGood |
Flag allowing to consider words from the white list only |
confidence |
Flag arpa file loading without pre-processing the words |
Methods:
- size - Method of obtaining the size of the N-gram
Example:
>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.size()
3
Methods:
- textToJson - Method to convert text to JSON
- isAllowApostrophe - Apostrophe permission check method
- switchAllowApostrophe - Method for permitting or denying an apostrophe as part of a word
Example:
>>> import alm
>>> def callbackFn(text):
... print(text)
...
>>> alm.isAllowApostrophe()
False
>>> alm.switchAllowApostrophe()
>>> alm.isAllowApostrophe()
True
>>> alm.textToJson("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", callbackFn)
[["«","On","nous","dit","qu'aujourd'hui","c'est","le","cas",",","encore","faudra-t-il","l'évaluer","»","l'astronomie"]]
Methods:
- jsonToText - Method to convert JSON to text
Example:
>>> import alm
>>> def callbackFn(text):
... print(text)
...
>>> alm.jsonToText('[["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"]]', callbackFn)
«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie
Methods:
- restore - Method for restore text from context
Example:
>>> import alm
>>> alm.restore(["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"])
"«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie"
Methods:
- addBadword - Method add bad word
- setBadwords - Method set words to blacklist
- getBadwords - Method get words in blacklist
Example:
>>> import alm
>>> alm.setBadwords(["hello", "world", "test"])
>>> alm.getBadwords()
{24227504, 1219922507, 1794085167}
>>> alm.addBadword("test2")
>>> alm.getBadwords()
{24227504, 5035487504, 1219922507, 1794085167}
Example:
>>> import alm
>>> alm.setBadwords({24227504, 1219922507, 1794085167})
>>> alm.getBadwords()
{24227504, 1219922507, 1794085167}
Methods:
- addGoodword - Method add good word
- setGoodwords - Method set words to whitelist
- getGoodwords - Method get words in whitelist
Example:
>>> import alm
>>> alm.setGoodwords(["hello", "world", "test"])
>>> alm.getGoodwords()
{24227504, 1219922507, 1794085167}
>>> alm.addGoodword("test2")
>>> alm.getGoodwords()
{24227504, 5035487504, 1219922507, 1794085167}
Example:
>>> import alm
>>> alm.setGoodwords({24227504, 1219922507, 1794085167})
>>> alm.getGoodwords()
{24227504, 1219922507, 1794085167}
Methods:
- setUserToken - Method for adding user token
- getUserTokens - User token list retrieval method
- getUserTokenId - Method for obtaining user token identifier
- getUserTokenWord - Method for obtaining a custom token by its identifier
Example:
>>> import alm
>>> alm.setUserToken("usa")
>>> alm.setUserToken("russia")
>>> alm.getUserTokenId("usa")
4188610529
>>> alm.getUserTokenId("russia")
47207634939
>>> alm.getUserTokens()
['usa', 'russia']
>>> alm.getUserTokenWord(4188610529)
'usa'
>>> alm.getUserTokenWord(47207634939)
'russia'
Methods:
- setUserTokenMethod - Method for set a custom token processing function
Example:
>>> import alm
>>> def fn(token, word):
... if token and (token == "<usa>"):
... if word and (word.lower() == "usa"):
... return True
... elif token and (token == "<russia>"):
... if word and (word.lower() == "russia"):
... return True
... return False
...
>>> alm.setUserToken("usa")
>>> alm.setUserToken("russia")
>>> alm.setUserTokenMethod("usa", fn)
>>> alm.setUserTokenMethod("russia", fn)
>>> alm.idw("usa")
346562990
>>> alm.idw("russia")
3602214519
>>> alm.getUserTokenWord(346562990)
'usa'
>>> alm.getUserTokenWord(3602214519)
'russia'
Methods:
- setWordPreprocessingMethod - Method for set the word preprocessing function
Example:
>>> import alm
>>> def run(word, context):
... if word == "возле": word = "около"
... return word
...
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.setWordPreprocessingMethod(run)
>>> a = alm.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
info: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор <punct> <punct> <punct> </s>
info: p( неожиданно | <s> ) = [2gram] 0.00250617 [ -2.60098900 ] / 0.99999999
info: p( из | неожиданно ...) = [3gram] 0.84584931 [ -0.07270700 ] / 1.00000081
info: p( подворотни | из ...) = [3gram] 0.73518561 [ -0.13360300 ] / 0.99999924
info: p( в | подворотни ...) = [3gram] 0.93193581 [ -0.03061400 ] / 0.99999960
info: p( олега | в ...) = [3gram] 0.72047846 [ -0.14237900 ] / 1.00000026
info: p( ударил | олега ...) = [3gram] 0.89971301 [ -0.04589600 ] / 1.00000043
info: p( яркий | ударил ...) = [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( прожектор | яркий ...) = [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( патрульный | прожектор ...) = [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( трактор | патрульный ...) = [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( <punct> | трактор ...) = [OOV] 0.00000000 [ -inf ] / 0.99999999
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( </s> | <punct> ...) = [1gram] 0.07816800 [ -1.10697100 ] / 1.00000011
info: 1 sentences, 13 words, 0 OOVs
info: 3 zeroprobs, logprob= -4.25945900 ppl= 2.01487019 ppl1= 2.12642805
info: <s> С лязгом выкатился и остановился около мальчика <punct> <punct> <punct> <punct> </s>
info: p( с | <s> ) = [2gram] 0.01301973 [ -1.88539800 ] / 0.99999999
info: p( лязгом | с ...) = [3gram] 0.21850984 [ -0.66052900 ] / 1.00000061
info: p( выкатился | лязгом ...) = [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( и | выкатился ...) = [3gram] 0.93211608 [ -0.03053000 ] / 0.99999926
info: p( остановился | и ...) = [3gram] 0.72065433 [ -0.14227300 ] / 0.99999975
info: p( около | остановился ...) = [1gram] 0.00003415 [ -4.46662200 ] / 1.00000027
info: p( мальчика | около ...) = [1gram] 0.00023364 [ -3.63146100 ] / 0.99999938
info: p( <punct> | мальчика ...) = [OOV] 0.00000000 [ -inf ] / 0.99999965
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( </s> | <punct> ...) = [1gram] 0.07816800 [ -1.10697100 ] / 1.00000011
info: 1 sentences, 11 words, 0 OOVs
info: 4 zeroprobs, logprob= -11.95535900 ppl= 9.91470774 ppl1= 12.21380039
>>> print(a.logprob)
-16.214818
Methods:
- initScripts - Python script initialization method
- setWordScript - Method set script of word processing
- getWordScript - Method get script of word processing
- setUserTokenScript - Method set script user token processing
- getUserTokenScript - Method for extracting a user token processing script
Example:
>>> import alm
>>> alm.setUserToken("usa")
>>> alm.setUserToken("russia")
>>> alm.setUserTokenScript("./script1.py")
>>> alm.getUserTokenScript()
'./script1.py'
>>> alm.setWordScript("./script2.py")
>>> alm.getWordScript()
'./script2.py'
>>> alm.initScripts()
The python script format to preprocess the received words
# -*- coding: utf-8 -*-
def init():
"""
Initialization Method: Runs only once at application startup
"""
def run(word, context):
"""
Processing start method: starts when a word is extracted from text
@word word for processing
@context sequence of previous words as an array
"""
return word
The python script format to define the word features
# -*- coding: utf-8 -*-
def init():
"""
Initialization Method: Runs only once at application startup
"""
def run(token, word):
"""
Processing start method: starts when a word is extracted from text
@token word token name
@word word for processing
"""
if token and (token == "<usa>"):
if word and (word.lower() == "usa"): return "ok"
elif token and (token == "<russia>"):
if word and (word.lower() == "russia"): return "ok"
return "no"
Methods:
- setLogfile - Method of set the file for log output
- setOOvFile - Method set file for saving OOVs words
Example:
>>> import alm
>>> alm.setLogfile("./log.txt")
>>> alm.setOOvFile("./oov.txt")
Methods:
- perplexity - Perplexity calculation
- pplConcatenate - Method of combining perplexia
- pplByFiles - Method for reading perplexity calculation by file or group of files
Example:
>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> a = alm.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
>>> print(a.logprob)
-8.238353
>>> print(a.oovs)
0
>>> print(a.words)
24
>>> print(a.sentences)
2
>>> print(a.zeroprobs)
7
>>> print(a.ppl)
2.135669866658319
>>> print(a.ppl1)
2.204269585673276
>>> b = alm.pplByFiles("./text.txt")
>>> c = alm.pplConcatenate(a, b)
>>> print(c.ppl)
7.384123548831112
Description
Name |
Description |
ppl |
The meaning of perplexity without considering the beginning of the sentence |
ppl1 |
The meaning of perplexion taking into account the beginning of the sentence |
oovs |
Count of oov words |
words |
Count of words in sentence |
logprob |
Word sequence frequency |
sentences |
Count of sequences |
zeroprobs |
Count of zero probs |
Methods:
- tokenization - Method for breaking text into tokens
Example:
>>> import alm
>>> def tokensFn(word, context, reset, stop):
... print(word, " => ", context)
... return True
...
>>> alm.switchAllowApostrophe()
>>> alm.tokenization("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", tokensFn)
« => []
On => ['«']
nous => ['«', 'On']
dit => ['«', 'On', 'nous']
qu'aujourd'hui => ['«', 'On', 'nous', 'dit']
c'est => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui"]
le => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est"]
cas => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le']
, => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas']
encore => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',']
faudra-t-il => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore']
l'évaluer => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il']
» => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', "l'évaluer"]
l'astronomie => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', "l'évaluer", '»']
Methods:
- fixUppers - Method for correcting registers in the text
- fixUppersByFiles - Method for correcting text registers in a text file
Example:
>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.fixUppers("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
'Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор??? С лязгом выкатился и остановился возле мальчика....'
>>> alm.fixUppersByFiles("./corpus", "./result.txt", "txt")
Methods:
- checkHypLat - Hyphen and latin character search method
Example:
>>> import alm
>>> alm.checkHypLat("Hello-World")
(True, True)
>>> alm.checkHypLat("Hello")
(False, True)
>>> alm.checkHypLat("Привет")
(False, False)
>>> alm.checkHypLat("так-как")
(True, False)
Methods:
- getUppers - Method for extracting registers for each word
Example:
>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.readLM('./lm.arpa')
>>> alm.idw("Living")
48384019276
>>> alm.idw("in")
2833
>>> alm.idw("the")
175734
>>> alm.idw("USA")
147770
>>> alm.getUppers([48384019276, 2833, 175734, 147770])
[1, 0, 0, 7]
Methods:
- urls - Method for extracting URL address coordinates in a string
Example:
>>> import alm
>>> alm.urls("This website: example.com was designed with ...")
{14: 25}
>>> alm.urls("This website: https://a.b.c.example.net?id=52#test-1 was designed with ...")
{14: 52}
>>> alm.urls("This website: https://a.b.c.example.net?id=52#test-1 and 127.0.0.1 was designed with ...")
{14: 52, 57: 66}
Methods:
- roman2Arabic - Method for translating Roman numerals to Arabic
Example:
>>> import alm
>>> alm.roman2Arabic("XVI")
16
Methods:
- rest - Method for correction and detection of words with mixed alphabets
- setSubstitutes - Method for set letters to correct words from mixed alphabets
- getSubstitutes - Method of extracting letters to correct words from mixed alphabets
Example:
>>> import alm
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>> alm.getSubstitutes()
{'a': 'а', 'b': 'в', 'c': 'с', 'e': 'е', 'h': 'н', 'k': 'к', 'm': 'м', 'o': 'о', 'p': 'р', 't': 'т', 'x': 'х'}
>>> str = "ПPИBETИК"
>>> str.lower()
'пpиbetик'
>>> alm.rest(str)
'приветик'
Methods:
- setTokensDisable - Method for set the list of forbidden tokens
- setTokensUnknown - Method for set the list of tokens cast to 〈unk〉
- setTokenDisable - Method for set the list of unidentifiable tokens
- setTokenUnknown - Method of set the list of tokens that need to be identified as 〈unk〉
- getTokensDisable - Method for retrieving the list of forbidden tokens
- getTokensUnknown - Method for extracting a list of tokens reducible to 〈unk〉
- setAllTokenDisable - Method for set all tokens as unidentifiable
- setAllTokenUnknown - The method of set all tokens identified as 〈unk〉
Example:
>>> import alm
>>> alm.idw("<date>")
6
>>> alm.idw("<time>")
7
>>> alm.idw("<abbr>")
5
>>> alm.idw("<math>")
9
>>> alm.setTokenDisable("date|time|abbr|math")
>>> alm.getTokensDisable()
{9, 5, 6, 7}
>>> alm.setTokensDisable({6, 7, 5, 9})
>>> alm.setTokenUnknown("date|time|abbr|math")
>>> alm.getTokensUnknown()
{9, 5, 6, 7}
>>> alm.setTokensUnknown({6, 7, 5, 9})
>>> alm.setAllTokenDisable()
>>> alm.getTokensDisable()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}
>>> alm.setAllTokenUnknown()
>>> alm.getTokensUnknown()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}
Methods:
- countAlphabet - Method of obtaining the number of letters in the dictionary
Example:
>>> import alm
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>> alm.countAlphabet()
26
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.countAlphabet()
59
Methods:
- countBigrams - Method get count bigrams
- countTrigrams - Method get count trigrams
- countGrams - Method get count N-gram by lm size
Example:
>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.countBigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
12
>>> alm.countTrigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>> alm.size()
3
>>> alm.countGrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>> alm.idw("неожиданно")
30444893210
>>> alm.idw("из")
4645
>>> alm.idw("подворотни")
7494072262
>>> alm.idw("в")
48
>>> alm.idw("Олега")
2431694341
>>> alm.idw("ударил")
54100711961
>>> alm.countBigrams([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
5
>>> alm.countTrigrams([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
4
>>> alm.countGrams([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
4
Methods:
- arabic2Roman - Convert arabic number to roman number
Example:
>>> import alm
>>> alm.arabic2Roman(23)
'XXIII'
>>> alm.arabic2Roman("33")
'XXXIII'
Methods:
- setLocale - Method set locale (Default: en_US.UTF-8)
Example:
>>> import alm
>>> alm.setLocale("ru_RU.UTF-8")
Methods:
- setThreads - Method for set the number of threads (0 - all threads)
Example:
>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.setThreads(3)
>>> a = alm.pplByFiles("./text.txt")
>>> print(a.logprob)
-48201.29481399994
Methods:
- fti - Method for removing the fractional part of a number
Example:
>>> import alm
>>> alm.fti(5892.4892)
5892489200000
>>> alm.fti(5892.4892, 4)
58924892
Methods:
- context - Method for assembling text context from a sequence
Example:
>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.idw("неожиданно")
30444893210
>>> alm.idw("из")
4645
>>> alm.idw("подворотни")
7494072262
>>> alm.idw("в")
48
>>> alm.idw("Олега")
2431694341
>>> alm.idw("ударил")
54100711961
>>> alm.context([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
'Неожиданно из подворотни в олега ударил'
Methods:
- findByFiles - Method search N-grams in a text file
Example:
>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.findByFiles("./text.txt", "./result.txt")
info: <s> Кукай
сари кукай
сари японские
японские каллиграфы
каллиграфы я
я постоянно
постоянно навещал
навещал их
их тайно
тайно от
от людей
людей </s>
info: <s> Неожиданно из
Неожиданно из подворотни
из подворотни в
подворотни в Олега
в Олега ударил
Олега ударил яркий
ударил яркий прожектор
яркий прожектор патрульный
прожектор патрульный трактор
патрульный трактор
<s> С лязгом
С лязгом выкатился
лязгом выкатился и
выкатился и остановился
и остановился возле
остановился возле мальчика
возле мальчика
Methods:
- checkSequence - Sequence Existence Method
- checkByFiles - Method for checking if a sequence exists in a text file
Example:
>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.checkSequence("Неожиданно из подворотни в олега ударил")
(True, 0)
>>> alm.checkSequence("<s> Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором </s>")
(True, 0)
>>> alm.checkSequence("<s> Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором </s>", True)
(False, 0)
>>> alm.checkSequence("<s> в Олега ударил яркий </s>")
(True, 0)
>>> alm.checkSequence("<s> в Олега ударил яркий </s>", True)
(True, 0)
>>> alm.checkSequence("от госсекретаря США")
(True, 7)
>>> alm.checkSequence("от госсекретаря США", True)
(False, 0)
>>> alm.idw("от")
5586
>>> alm.idw("госсекретаря")
10074609004
>>> alm.idw("США")\
338449
>>> alm.checkSequence([5586, 10074609004, 338449])
(True, 7)
>>> alm.checkSequence([5586, 10074609004, 338449], True)
(False, 0)
>>> alm.checkByFiles("./text.txt", "./result.txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались
info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>> alm.checkByFiles("./corpus", "./result.txt", False, "txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались
info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>> alm.checkByFiles("./corpus", "./result.txt", True, "txt")
info: 2000 | NO | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2001 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2002 | NO | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2004 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2005 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 0
Not exists texts: 2007
Methods:
- check - String Check Method
- match - String Matching Method
- setAbbr - Method set abbreviation
Example:
>>> import alm
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>> alm.check("Дом-2", alm.check_t.home2)
True
>>> alm.check("Дом2", alm.check_t.home2)
False
>>> alm.check("Дом-2", alm.check_t.latian)
False
>>> alm.check("Hello", alm.check_t.latian)
True
>>> alm.check("прiвет", alm.check_t.latian)
True
>>> alm.check("Дом-2", alm.check_t.hyphen)
True
>>> alm.check("Дом2", alm.check_t.hyphen)
False
>>> alm.check("Д", alm.check_t.letter)
True
>>> alm.check("$", alm.check_t.letter)
False
>>> alm.check("-", alm.check_t.letter)
False
>>> alm.check("просtоквaшино", alm.check_t.similars)
True
>>> alm.match("my site http://example.ru, it's true", alm.match_t.url)
True
>>> alm.match("по вашему ip адресу 46.40.123.12 проводится проверка", alm.match_t.url)
True
>>> alm.match("мой адрес в формате IPv6: http://[2001:0db8:11a3:09d7:1f34:8a2e:07a0:765d]/", alm.match_t.url)
True
>>> alm.match("13-я", alm.match_t.abbr)
True
alm.match("13-я-й", alm.match_t.abbr)
False
alm.match("т.д", alm.match_t.abbr)
True
alm.match("т.п.", alm.match_t.abbr)
True
>>> alm.match("С.Ш.А.", alm.match_t.abbr)
True
>>> alm.setAbbr("сша")
>>> alm.match("США", alm.match_t.abbr)
True
>>> alm.match("Hello", alm.match_t.latian)
True
>>> alm.match("прiвет", alm.match_t.latian)
False
>>> alm.match("23424", alm.match_t.number)
True
>>> alm.match("hello", alm.match_t.number)
False
>>> alm.match("23424.55", alm.match_t.number)
False
>>> alm.match("23424", alm.match_t.decimal)
False
>>> alm.match("23424.55", alm.match_t.decimal)
True
>>> alm.match("23424,55", alm.match_t.decimal)
True
>>> alm.match("-23424.55", alm.match_t.decimal)
True
>>> alm.match("+23424.55", alm.match_t.decimal)
True
>>> alm.match("+23424.55", alm.match_t.anumber)
True
>>> alm.match("15T-34", alm.match_t.anumber)
True
>>> alm.match("hello", alm.match_t.anumber)
False
>>> alm.match("hello", alm.match_t.allowed)
True
>>> alm.match("évaluer", alm.match_t.allowed)
False
>>> alm.match("13", alm.match_t.allowed)
True
>>> alm.match("Hello-World", alm.match_t.allowed)
True
>>> alm.match("Hello", alm.match_t.math)
False
>>> alm.match("+", alm.match_t.math)
True
>>> alm.match("=", alm.match_t.math)
True
>>> alm.match("Hello", alm.match_t.upper)
True
>>> alm.match("hello", alm.match_t.upper)
False
>>> alm.match("hellO", alm.match_t.upper)
False
>>> alm.match("a", alm.match_t.punct)
False
>>> alm.match(",", alm.match_t.punct)
True
>>> alm.match(" ", alm.match_t.space)
True
>>> alm.match("a", alm.match_t.space)
False
>>> alm.match("a", alm.match_t.special)
False
>>> alm.match("±", alm.match_t.special)
True
>>> alm.match("[", alm.match_t.isolation)
True
>>> alm.match("a", alm.match_t.isolation)
False
Methods:
- delInText - Method for delete letter in text
Example:
>>> import alm
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", alm.wdel_t.punct)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>> alm.delInText("hello-world-hello-world", alm.wdel_t.hyphen)
'helloworldhelloworld'
>>> alm.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", alm.wdel_t.broken)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>> alm.delInText("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", alm.wdel_t.broken)
'On nous dit quaujourdhui cest le cas encore faudra-t-il lvaluer lastronomie'
Methods:
- countsByFiles - Method for counting the number of n-grams in a text file
Example:
>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.countsByFiles("./text.txt", "./result.txt", 3)
info: 0 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 0 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
Counts 3grams: 471
>>> alm.countsByFiles("./corpus", "./result.txt", 2, "txt")
info: 19 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 10 | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 27 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
Counts 2grams: 20270
Description
N-gram size |
Description |
1 |
language model size |
2 |
bigram |
3 |
trigram |