Skip to main content
Join the official 2020 Python Developers SurveyStart the survey!

Smart language model

Project description

ANYKS Language Model (ALM)

Requirements

Install PyBind11

$ python3 -m pip install pybind11

Description of Methods

Methods:

  • idw - Word ID retrieval method
  • idt - Token ID retrieval method
  • ids - Sequence ID retrieval method

Example:

>>> import alm
>>> alm.idw("hello")
1794085167
>>> alm.idw("<s>")
1
>>> alm.idw("</s>")
19
>>> alm.idw("<unk>")
3
>>> alm.idt("1424")
2
>>> alm.idt("hello")
0
>>> alm.idw("Living")
48384019276
>>> alm.idw("in")
2833
>>> alm.idw("the")
175734
>>> alm.idw("USA")
147770
>>> alm.ids([48384019276, 2833, 175734, 147770])
2514129976

Description

Name Description
〈s〉 Sentence beginning token
〈/s〉 Sentence end token
〈url〉 URL-address token
〈num〉 Number (arabic or roman) token
〈unk〉 Unknown word token
〈date〉 Date token (18.07.2004 ¦ 07/18/2004)
〈time〉 Time token (15:44:56)
〈abbr〉 Abbreviation token (1-й ¦ 2-е ¦ 20-я ¦ p.s ¦ p.s.)
〈anum〉 Pseudo-number token (combination of numbers and other symbols) (T34 ¦ 895-M-86 ¦ 39km)
〈math〉 Mathematical operation token (+ ¦ - ¦ = ¦ / ¦ * ¦ ^)
〈range〉 Range of numbers token (1-2 ¦ 100-200 ¦ 300-400)
〈aprox〉 Approximate number token (~93 ¦ 95.86 ¦ 1020)
〈score〉 Score count token (4:3 ¦ 01:04)
〈dimen〉 Dimensions token (200x300 ¦ 1920x1080)
〈fract〉 Fraction token (5/20 ¦ 192/864)
〈punct〉 Punctuation token (. ¦ ... ¦ , ¦ ! ¦ ? ¦ : ¦ ;)
〈specl〉 Special character token (~ ¦ @ ¦ # ¦ № ¦ % ¦ & ¦ $ ¦ § ¦ © )
〈isolat〉 Isolation/quotation token (" ¦ ' ¦ « ¦ » ¦ „ ¦ “ ¦ ` ¦ ( ¦ ) ¦ [ ¦ ] ¦ { ¦ })

Methods:

  • setZone - User zone set method

Example:

>>> import alm
>>> alm.setZone("com")

Methods:

  • clear - Method clear all data
  • setAlphabet - Method set alphabet
  • getAlphabet - Method get alphabet

Example:

>>> import alm
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя'
>>> alm.clear()
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'

Methods:

  • setUnknown - Method set unknown word

Example:

>>> import alm
>>> alm.setUnknown("word")

Methods:

  • getUnknown - Method extraction unknown word

Example:

>>> import alm
>>> alm.setUnknown("word")
>>> alm.getUnknown()
'word'

Methods:

  • sentences - Sentences generation method
  • readLM - Method for reading data from arpa file
  • sentencesToFile - Method for assembling a specified number of sentences and writing to a file

Example:

>>> import alm
>>> def sentencesFn(text):
...     print("Sentences:", text)
...     return True
...
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.sentences(sentencesFn)
Sentences: <s> В общем </s>
Sentences: <s> С лязгом выкатился и остановился возле мальчика </s>
Sentences: <s> У меня нет </s>
Sentences: <s> Я вообще не хочу </s>
Sentences: <s> Да и в общем </s>
Sentences: <s> Не могу </s>
Sentences: <s> Ну в общем </s>
Sentences: <s> Так что я вообще не хочу </s>
Sentences: <s> Потому что я вообще не хочу </s>
Sentences: <s> Продолжение следует </s>
Sentences: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор </s>
>>> alm.sentencesToFile(5, "./result.txt")

Read binary file of language model

Example:

>>> import alm
>>> def sentencesFn(text):
...     print("Sentences:", text)
...     return True
...
>>> meta = {
... "aes": 128,
... "name": "Test",
... "author": "Test",
... "lictype": "MIT"
... }
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.alm', meta)
>>> alm.sentences(sentencesFn)
Sentences: <s> В общем </s>
Sentences: <s> С лязгом выкатился и остановился возле мальчика </s>
Sentences: <s> У меня нет </s>
Sentences: <s> Я вообще не хочу </s>
Sentences: <s> Да и в общем </s>
Sentences: <s> Не могу </s>
Sentences: <s> Ну в общем </s>
Sentences: <s> Так что я вообще не хочу </s>
Sentences: <s> Потому что я вообще не хочу </s>
Sentences: <s> Продолжение следует </s>
Sentences: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор </s>
>>> alm.sentencesToFile(5, "./result.txt")

Binary container metadata

{
	"aes": 128,
	"name": "Name dictionary",
	"author": "Name author",
	"lictype": "License type",
	"lictext": "License text",
	"contacts": "Contacts data",
	"password": "Password if needed",
	"copyright": "Copyright author"
}

Description:

  • aes - AES Encryption Size (128, 192, 256) bits
  • name - The dictionary name
  • author - The author of the dictionary
  • lictype - The license type
  • lictext - The license text
  • contacts - The author contact info
  • password - An encryption password (if required), encryption is performed only when setting a password
  • copyright - Copyright of the dictionary owner

Methods:

  • findNgram - N-gram search method in text

Example:

>>> import alm
>>> def callbackFn(text):
...     print(text)
... 
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.findNgram("Особое место занимает чудотворная икона Лобзание Христа Иудою", callbackFn)
<s> Особое
Особое место
место занимает
занимает чудотворная
чудотворная икона
икона Лобзание
Лобзание Христа
Христа Иудою
Иудою </s>


>>>

Methods:

  • setOption - Method for set module options

Example:

>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.mixdicts)
>>> alm.setOption(alm.options_t.onlyGood)
>>> alm.setOption(alm.options_t.confidence)

Methods:

  • unsetOption - Disable module option method

Example:

>>> import alm
>>> alm.unsetOption(alm.options_t.debug)
>>> alm.unsetOption(alm.options_t.mixdicts)
>>> alm.unsetOption(alm.options_t.onlyGood)
>>> alm.unsetOption(alm.options_t.confidence)

Description

Name Description
debug Flag debug mode
mixdicts Flag allowing the use of words consisting of mixed dictionaries
onlyGood Flag allowing to consider words from the white list only
confidence Flag arpa file loading without pre-processing the words

Methods:

  • size - Method of obtaining the size of the N-gram

Example:

>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.size()
3

Methods:

  • textToJson - Method to convert text to JSON
  • isAllowApostrophe - Apostrophe permission check method
  • switchAllowApostrophe - Method for permitting or denying an apostrophe as part of a word

Example:

>>> import alm
>>> def callbackFn(text):
...     print(text)
... 
>>> alm.isAllowApostrophe()
False
>>> alm.switchAllowApostrophe()
>>> alm.isAllowApostrophe()
True
>>> alm.textToJson("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", callbackFn)
[["«","On","nous","dit","qu'aujourd'hui","c'est","le","cas",",","encore","faudra-t-il","l'évaluer","»","l'astronomie"]]

Methods:

  • jsonToText - Method to convert JSON to text

Example:

>>> import alm
>>> def callbackFn(text):
...     print(text)
... 
>>> alm.jsonToText('[["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"]]', callbackFn)
«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie

Methods:

  • restore - Method for restore text from context

Example:

>>> import alm
>>> alm.restore(["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"])
"«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie"

Methods:

  • addBadword - Method add bad word
  • setBadwords - Method set words to blacklist
  • getBadwords - Method get words in blacklist

Example:

>>> import alm
>>> alm.setBadwords(["hello", "world", "test"])
>>> alm.getBadwords()
{24227504, 1219922507, 1794085167}
>>> alm.addBadword("test2")
>>> alm.getBadwords()
{24227504, 5035487504, 1219922507, 1794085167}

Example:

>>> import alm
>>> alm.setBadwords({24227504, 1219922507, 1794085167})
>>> alm.getBadwords()
{24227504, 1219922507, 1794085167}

Methods:

  • addGoodword - Method add good word
  • setGoodwords - Method set words to whitelist
  • getGoodwords - Method get words in whitelist

Example:

>>> import alm
>>> alm.setGoodwords(["hello", "world", "test"])
>>> alm.getGoodwords()
{24227504, 1219922507, 1794085167}
>>> alm.addGoodword("test2")
>>> alm.getGoodwords()
{24227504, 5035487504, 1219922507, 1794085167}

Example:

>>> import alm
>>> alm.setGoodwords({24227504, 1219922507, 1794085167})
>>> alm.getGoodwords()
{24227504, 1219922507, 1794085167}

Methods:

  • setUserToken - Method for adding user token
  • getUserTokens - User token list retrieval method
  • getUserTokenId - Method for obtaining user token identifier
  • getUserTokenWord - Method for obtaining a custom token by its identifier

Example:

>>> import alm
>>> alm.setUserToken("usa")
>>> alm.setUserToken("russia")
>>> alm.getUserTokenId("usa")
4188610529
>>> alm.getUserTokenId("russia")
47207634939
>>> alm.getUserTokens()
['usa', 'russia']
>>> alm.getUserTokenWord(4188610529)
'usa'
>>> alm.getUserTokenWord(47207634939)
'russia'

Methods:

  • setUserTokenMethod - Method for set a custom token processing function

Example:

>>> import alm
>>> def fn(token, word):
...     if token and (token == "<usa>"):
...         if word and (word.lower() == "usa"):
...             return True
...     elif token and (token == "<russia>"):
...         if word and (word.lower() == "russia"):
...             return True
...     return False
... 
>>> alm.setUserToken("usa")
>>> alm.setUserToken("russia")
>>> alm.setUserTokenMethod("usa", fn)
>>> alm.setUserTokenMethod("russia", fn)
>>> alm.idw("usa")
346562990
>>> alm.idw("russia")
3602214519
>>> alm.getUserTokenWord(346562990)
'usa'
>>> alm.getUserTokenWord(3602214519)
'russia'

Methods:

  • setWordPreprocessingMethod - Method for set the word preprocessing function

Example:

>>> import alm
>>> def run(word, context):
...     if word == "возле": word = "около"
...     return word
... 
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.setWordPreprocessingMethod(run)
>>> a = alm.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
info: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор <punct> <punct> <punct> </s>

info: p( неожиданно | <s> ) 	= [2gram] 0.00250617 [ -2.60098900 ] / 0.99999999
info: p( из | неожиданно ...) 	= [3gram] 0.84584931 [ -0.07270700 ] / 1.00000081
info: p( подворотни | из ...) 	= [3gram] 0.73518561 [ -0.13360300 ] / 0.99999924
info: p( в | подворотни ...) 	= [3gram] 0.93193581 [ -0.03061400 ] / 0.99999960
info: p( олега | в ...) 	= [3gram] 0.72047846 [ -0.14237900 ] / 1.00000026
info: p( ударил | олега ...) 	= [3gram] 0.89971301 [ -0.04589600 ] / 1.00000043
info: p( яркий | ударил ...) 	= [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( прожектор | яркий ...) 	= [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( патрульный | прожектор ...) 	= [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( трактор | патрульный ...) 	= [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( <punct> | трактор ...) 	= [OOV] 0.00000000 [ -inf ] / 0.99999999
info: p( <punct> | <punct> ...) 	= [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( <punct> | <punct> ...) 	= [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( </s> | <punct> ...) 	= [1gram] 0.07816800 [ -1.10697100 ] / 1.00000011

info: 1 sentences, 13 words, 0 OOVs
info: 3 zeroprobs, logprob= -4.25945900 ppl= 2.01487019 ppl1= 2.12642805

info: <s> С лязгом выкатился и остановился около мальчика <punct> <punct> <punct> <punct> </s>

info: p( с | <s> ) 	= [2gram] 0.01301973 [ -1.88539800 ] / 0.99999999
info: p( лязгом | с ...) 	= [3gram] 0.21850984 [ -0.66052900 ] / 1.00000061
info: p( выкатился | лязгом ...) 	= [3gram] 0.92987592 [ -0.03157500 ] / 0.99999918
info: p( и | выкатился ...) 	= [3gram] 0.93211608 [ -0.03053000 ] / 0.99999926
info: p( остановился | и ...) 	= [3gram] 0.72065433 [ -0.14227300 ] / 0.99999975
info: p( около | остановился ...) 	= [1gram] 0.00003415 [ -4.46662200 ] / 1.00000027
info: p( мальчика | около ...) 	= [1gram] 0.00023364 [ -3.63146100 ] / 0.99999938
info: p( <punct> | мальчика ...) 	= [OOV] 0.00000000 [ -inf ] / 0.99999965
info: p( <punct> | <punct> ...) 	= [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( <punct> | <punct> ...) 	= [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( <punct> | <punct> ...) 	= [OOV] 0.00000000 [ -inf ] / 1.00000011
info: p( </s> | <punct> ...) 	= [1gram] 0.07816800 [ -1.10697100 ] / 1.00000011

info: 1 sentences, 11 words, 0 OOVs
info: 4 zeroprobs, logprob= -11.95535900 ppl= 9.91470774 ppl1= 12.21380039
>>> print(a.logprob)
-16.214818

Methods:

  • setLogfile - Method of set the file for log output
  • setOOvFile - Method set file for saving OOVs words

Example:

>>> import alm
>>> alm.setLogfile("./log.txt")
>>> alm.setOOvFile("./oov.txt")

Methods:

  • perplexity - Perplexity calculation
  • pplConcatenate - Method of combining perplexia
  • pplByFiles - Method for reading perplexity calculation by file or group of files

Example:

>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> a = alm.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
>>> print(a.logprob)
-8.238353
>>> print(a.oovs)
0
>>> print(a.words)
24
>>> print(a.sentences)
2
>>> print(a.zeroprobs)
7
>>> print(a.ppl)
2.135669866658319
>>> print(a.ppl1)
2.204269585673276
>>> b = alm.pplByFiles("./text.txt")
>>> c = alm.pplConcatenate(a, b)
>>> print(c.ppl)
7.384123548831112

Description

Name Description
ppl The meaning of perplexity without considering the beginning of the sentence
ppl1 The meaning of perplexion taking into account the beginning of the sentence
oovs Count of oov words
words Count of words in sentence
logprob Word sequence frequency
sentences Count of sequences
zeroprobs Count of zero probs

Methods:

  • tokenization - Method for breaking text into tokens

Example:

>>> import alm
>>> def tokensFn(word, context, reset, stop):
...     print(word, " => ", context)
...     return True
...
>>> alm.switchAllowApostrophe()
>>> alm.tokenization("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", tokensFn)
«  =>  []
On  =>  ['«']
nous  =>  ['«', 'On']
dit  =>  ['«', 'On', 'nous']
qu'aujourd'hui  =>  ['«', 'On', 'nous', 'dit']
c'est  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui"]
le  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est"]
cas  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le']
,  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas']
encore  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',']
faudra-t-il  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore']
l'évaluer  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il']
»  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', "l'évaluer"]
l'astronomie  =>  ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', "l'évaluer", '»']

Methods:

  • setTokenizerFn - Method for set the function of an external tokenizer

Example:

>>> import alm
>>> def tokenizerFn(text, callback):
...     word = ""
...     context = []
...     for letter in text:
...         if letter == " " and len(word) > 0:
...             if not callback(word, context, False, False): return
...             context.append(word)
...             word = ""
...         elif letter == "." or letter == "!" or letter == "?":
...             if not callback(word, context, True, False): return
...             word = ""
...             context = []
...         else:
...             word += letter
...     if len(word) > 0:
...         if not callback(word, context, False, True): return
...
>>> def tokensFn(word, context, reset, stop):
...     print(word, " => ", context)
...     return True
...
>>> alm.setTokenizerFn(tokenizerFn)
>>> alm.tokenization("Hello World today!", tokensFn)
Hello  =>  []
World  =>  ['Hello']
today  =>  ['Hello', 'World']

Methods:

  • fixUppers - Method for correcting registers in the text
  • fixUppersByFiles - Method for correcting text registers in a text file

Example:

>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.fixUppers("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
'Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор??? С лязгом выкатился и остановился возле мальчика....'
>>> alm.fixUppersByFiles("./corpus", "./result.txt", "txt")

Methods:

  • checkHypLat - Hyphen and latin character search method

Example:

>>> import alm
>>> alm.checkHypLat("Hello-World")
(True, True)
>>> alm.checkHypLat("Hello")
(False, True)
>>> alm.checkHypLat("Привет")
(False, False)
>>> alm.checkHypLat("так-как")
(True, False)

Methods:

  • getUppers - Method for extracting registers for each word

Example:

>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.readLM('./lm.arpa')
>>> alm.idw("Living")
48384019276
>>> alm.idw("in")
2833
>>> alm.idw("the")
175734
>>> alm.idw("USA")
147770
>>> alm.getUppers([48384019276, 2833, 175734, 147770])
[1, 0, 0, 7]

Methods:

  • urls - Method for extracting URL address coordinates in a string

Example:

>>> import alm
>>> alm.urls("This website: example.com was designed with ...")
{14: 25}
>>> alm.urls("This website: https://a.b.c.example.net?id=52#test-1 was designed with ...")
{14: 52}
>>> alm.urls("This website: https://a.b.c.example.net?id=52#test-1 and 127.0.0.1 was designed with ...")
{14: 52, 57: 66}

Methods:

  • roman2Arabic - Method for translating Roman numerals to Arabic

Example:

>>> import alm
>>> alm.roman2Arabic("XVI")
16

Methods:

  • rest - Method for correction and detection of words with mixed alphabets
  • setSubstitutes - Method for set letters to correct words from mixed alphabets
  • getSubstitutes - Method of extracting letters to correct words from mixed alphabets

Example:

>>> import alm
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>> alm.getSubstitutes()
{'a': 'а', 'b': 'в', 'c': 'с', 'e': 'е', 'h': 'н', 'k': 'к', 'm': 'м', 'o': 'о', 'p': 'р', 't': 'т', 'x': 'х'}
>>> str = "ПPИBETИК"
>>> str.lower()
'пpиbetик'
>>> alm.rest(str)
'приветик'

Methods:

  • setTokensDisable - Method for set the list of forbidden tokens
  • setTokensUnknown - Method for set the list of tokens cast to 〈unk〉
  • setTokenDisable - Method for set the list of unidentifiable tokens
  • setTokenUnknown - Method of set the list of tokens that need to be identified as 〈unk〉
  • getTokensDisable - Method for retrieving the list of forbidden tokens
  • getTokensUnknown - Method for extracting a list of tokens reducible to 〈unk〉
  • setAllTokenDisable - Method for set all tokens as unidentifiable
  • setAllTokenUnknown - The method of set all tokens identified as 〈unk〉

Example:

>>> import alm
>>> alm.idw("<date>")
6
>>> alm.idw("<time>")
7
>>> alm.idw("<abbr>")
5
>>> alm.idw("<math>")
9
>>> alm.setTokenDisable("date|time|abbr|math")
>>> alm.getTokensDisable()
{9, 5, 6, 7}
>>> alm.setTokensDisable({6, 7, 5, 9})
>>> alm.setTokenUnknown("date|time|abbr|math")
>>> alm.getTokensUnknown()
{9, 5, 6, 7}
>>> alm.setTokensUnknown({6, 7, 5, 9})
>>> alm.setAllTokenDisable()
>>> alm.getTokensDisable()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}
>>> alm.setAllTokenUnknown()
>>> alm.getTokensUnknown()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}

Methods:

  • countAlphabet - Method of obtaining the number of letters in the dictionary

Example:

>>> import alm
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>> alm.countAlphabet()
26
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.countAlphabet()
59

Methods:

  • countBigrams - Method get count bigrams
  • countTrigrams - Method get count trigrams
  • countGrams - Method get count N-gram by lm size

Example:

>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.countBigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
12
>>> alm.countTrigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>> alm.size()
3
>>> alm.countGrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>> alm.idw("неожиданно")
30444893210
>>> alm.idw("из")
4645
>>> alm.idw("подворотни")
7494072262
>>> alm.idw("в")
48
>>> alm.idw("Олега")
2431694341
>>> alm.idw("ударил")
54100711961
>>> alm.countBigrams([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
5
>>> alm.countTrigrams([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
4
>>> alm.countGrams([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
4

Methods:

  • arabic2Roman - Convert arabic number to roman number

Example:

>>> import alm
>>> alm.arabic2Roman(23)
'XXIII'
>>> alm.arabic2Roman("33")
'XXXIII'

Methods:

  • setLocale - Method set locale (Default: en_US.UTF-8)

Example:

>>> import alm
>>> alm.setLocale("ru_RU.UTF-8")

Methods:

  • setThreads - Method for set the number of threads (0 - all threads)

Example:

>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.setThreads(3)
>>> a = alm.pplByFiles("./text.txt")
>>> print(a.logprob)
-48201.29481399994

Methods:

  • fti - Method for removing the fractional part of a number

Example:

>>> import alm
>>> alm.fti(5892.4892)
5892489200000
>>> alm.fti(5892.4892, 4)
58924892

Methods:

  • context - Method for assembling text context from a sequence

Example:

>>> import alm
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.idw("неожиданно")
30444893210
>>> alm.idw("из")
4645
>>> alm.idw("подворотни")
7494072262
>>> alm.idw("в")
48
>>> alm.idw("Олега")
2431694341
>>> alm.idw("ударил")
54100711961
>>> alm.context([30444893210, 4645, 7494072262, 48, 2431694341, 54100711961])
'Неожиданно из подворотни в олега ударил'

Methods:

  • findByFiles - Method search N-grams in a text file

Example:

>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.findByFiles("./text.txt", "./result.txt")
info: <s> Кукай
сари кукай
сари японские
японские каллиграфы
каллиграфы я
я постоянно
постоянно навещал
навещал их
их тайно
тайно от
от людей
людей </s>


info: <s> Неожиданно из
Неожиданно из подворотни
из подворотни в
подворотни в Олега
в Олега ударил
Олега ударил яркий
ударил яркий прожектор
яркий прожектор патрульный
прожектор патрульный трактор
патрульный трактор

<s> С лязгом
С лязгом выкатился
лязгом выкатился и
выкатился и остановился
и остановился возле
остановился возле мальчика
возле мальчика

Methods:

  • checkSequence - Sequence Existence Method
  • checkByFiles - Method for checking if a sequence exists in a text file

Example:

>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.checkSequence("Неожиданно из подворотни в олега ударил")
(True, 0)
>>> alm.checkSequence("<s> Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором </s>")
(True, 0)
>>> alm.checkSequence("<s> Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором </s>", True)
(False, 0)
>>> alm.checkSequence("<s> в Олега ударил яркий </s>")
(True, 0)
>>> alm.checkSequence("<s> в Олега ударил яркий </s>", True)
(True, 0)
>>> alm.checkSequence("от госсекретаря США")
(True, 7)
>>> alm.checkSequence("от госсекретаря США", True)
(False, 0)
>>> alm.idw("от")
5586
>>> alm.idw("госсекретаря")
10074609004
>>> alm.idw("США")
338449
>>> alm.checkSequence([5586, 10074609004, 338449])
(True, 7)
>>> alm.checkSequence([5586, 10074609004, 338449], True)
(False, 0)
>>> alm.checkSequence(["от", "госсекретаря", "США"])
(True, 7)
>>> alm.checkSequence(["от", "госсекретаря", "США"], True)
(False, 0)
>>> alm.checkByFiles("./text.txt", "./result.txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались

info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.

info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>> alm.checkByFiles("./corpus", "./result.txt", False, "txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались

info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.

info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>> alm.checkByFiles("./corpus", "./result.txt", True, "txt")
info: 2000 | NO | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 2001 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.

info: 2002 | NO | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 2004 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 2005 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

All texts: 2007
Exists texts: 0
Not exists texts: 2007

Methods:

  • check - String Check Method
  • match - String Matching Method
  • setAbbr - Method set abbreviation

Example:

>>> import alm
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>> alm.check("Дом-2", alm.check_t.home2)
True
>>> alm.check("Дом2", alm.check_t.home2)
False
>>> alm.check("Дом-2", alm.check_t.latian)
False
>>> alm.check("Hello", alm.check_t.latian)
True
>>> alm.check("прiвет", alm.check_t.latian)
True
>>> alm.check("Дом-2", alm.check_t.hyphen)
True
>>> alm.check("Дом2", alm.check_t.hyphen)
False
>>> alm.check("Д", alm.check_t.letter)
True
>>> alm.check("$", alm.check_t.letter)
False
>>> alm.check("-", alm.check_t.letter)
False
>>> alm.check("просtоквaшино", alm.check_t.similars)
True
>>> alm.match("my site http://example.ru, it's true", alm.match_t.url)
True
>>> alm.match("по вашему ip адресу 46.40.123.12 проводится проверка", alm.match_t.url)
True
>>> alm.match("мой адрес в формате IPv6: http://[2001:0db8:11a3:09d7:1f34:8a2e:07a0:765d]/", alm.match_t.url)
True
>>> alm.match("13-я", alm.match_t.abbr)
True
alm.match("13-я-й", alm.match_t.abbr)
False
alm.match("т.д", alm.match_t.abbr)
True
alm.match("т.п.", alm.match_t.abbr)
True
>>> alm.match("С.Ш.А.", alm.match_t.abbr)
True
>>> alm.setAbbr("сша")
>>> alm.match("США", alm.match_t.abbr)
True
>>> alm.match("Hello", alm.match_t.latian)
True
>>> alm.match("прiвет", alm.match_t.latian)
False
>>> alm.match("23424", alm.match_t.number)
True
>>> alm.match("hello", alm.match_t.number)
False
>>> alm.match("23424.55", alm.match_t.number)
False
>>> alm.match("23424", alm.match_t.decimal)
False
>>> alm.match("23424.55", alm.match_t.decimal)
True
>>> alm.match("23424,55", alm.match_t.decimal)
True
>>> alm.match("-23424.55", alm.match_t.decimal)
True
>>> alm.match("+23424.55", alm.match_t.decimal)
True
>>> alm.match("+23424.55", alm.match_t.anumber)
True
>>> alm.match("15T-34", alm.match_t.anumber)
True
>>> alm.match("hello", alm.match_t.anumber)
False
>>> alm.match("hello", alm.match_t.allowed)
True
>>> alm.match("évaluer", alm.match_t.allowed)
False
>>> alm.match("13", alm.match_t.allowed)
True
>>> alm.match("Hello-World", alm.match_t.allowed)
True
>>> alm.match("Hello", alm.match_t.math)
False
>>> alm.match("+", alm.match_t.math)
True
>>> alm.match("=", alm.match_t.math)
True
>>> alm.match("Hello", alm.match_t.upper)
True
>>> alm.match("hello", alm.match_t.upper)
False
>>> alm.match("hellO", alm.match_t.upper)
False
>>> alm.match("a", alm.match_t.punct)
False
>>> alm.match(",", alm.match_t.punct)
True
>>> alm.match(" ", alm.match_t.space)
True
>>> alm.match("a", alm.match_t.space)
False
>>> alm.match("a", alm.match_t.special)
False
>>> alm.match("±", alm.match_t.special)
True
>>> alm.match("[", alm.match_t.isolation)
True
>>> alm.match("a", alm.match_t.isolation)
False

Methods:

  • delInText - Method for delete letter in text

Example:

>>> import alm
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", alm.wdel_t.punct)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>> alm.delInText("hello-world-hello-world", alm.wdel_t.hyphen)
'helloworldhelloworld'
>>> alm.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", alm.wdel_t.broken)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>> alm.delInText("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", alm.wdel_t.broken)
'On nous dit quaujourdhui cest le cas encore faudra-t-il lvaluer lastronomie'

Methods:

  • countsByFiles - Method for counting the number of n-grams in a text file

Example:

>>> import alm
>>> alm.setOption(alm.options_t.debug)
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readLM('./lm.arpa')
>>> alm.countsByFiles("./text.txt", "./result.txt", 3)
info: 0 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 0 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

Counts 3grams: 471
>>> alm.countsByFiles("./corpus", "./result.txt", 2, "txt")
info: 19 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.

info: 10 | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....

info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.

info: 27 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии

Counts 2grams: 20270

Description

N-gram size Description
1 language model size
2 bigram
3 trigram

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for anyks-lm, version 3.0.4
Filename, size File type Python version Upload date Hashes
Filename, size anyks_lm-3.0.4-cp37-cp37m-macosx_10_15_x86_64.whl (906.7 kB) File type Wheel Python version cp37 Upload date Hashes View
Filename, size anyks-lm-3.0.4.tar.gz (382.8 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page