ANYKS Language Model (ALM)
Project goals and features
The are many toolkits capable of creating language models: (KenLM, SriLM, IRSTLM), and each of those toolkits may have a reason to exist. But our language model creation toolkit has the following goals and features:
- UTF-8 support: Full UTF-8 support without third-party dependencies.
- Support of many data formats: ARPA, Vocab, Map Sequence, N-grams, Binary alm dictionary.
- Smoothing algorithms: Kneser-Nay, Modified Kneser-Nay, Witten-Bell, Additive, Good-Turing, Absolute discounting.
- Normalisation and preprocessing for corpora: Transferring corpus to lowercase, smart tokenization, ability to create black - and white - lists for n-grams.
- ARPA modification: Frequencies and n-grams replacing, adding new n-grams with frequencies, removing n-grams.
- Pruning: N-gram removal based on specified criteria.
- Removal of low-probability n-grams: Removal of n-grams which backoff probability is higher than standard probability.
- ARPA recovery: Recovery of damaged n-grams in ARPA with subsequent recalculation of their backoff probabilities.
- Support of additional word features: Feature extraction: (numbers, roman numbers, ranges of numbers, numeric abbreviations, any other custom attributes) using scripts written in Python3.
- Text preprocessing: Unlike all other language model toolkits, ALM can extract correct context from files with unnormalized texts.
- Unknown word token accounting: Accounting of 〈unk〉 token as full n-gram.
- Redefinition of 〈unk〉 token: Ability to redefine an attribute of an unknown token.
- N-grams preprocessing: Ability to pre-process n-grams before adding them to ARPA using custom Python3 scripts.
- Binary container for Language Models: The binary container supports compression, encryption and installation of copyrights.
- Convenient visualization of the Language model assembly process: ALM implements several types of visualizations: textual, graphic, process indicator, and logging to files or console.
- Collection of all n-grams: Unlike other language model toolkits, ALM is guaranteed to extract all possible n-grams from the corpus, regardless of their length (except for Modified Kneser-Nay); you can also force all n-grams to be taken into account even if they occured only once.
Requirements
Install PyBind11
$ python3 -m pip install pybind11
Description of Methods
Methods:
- idw - Word ID retrieval method
- idt - Token ID retrieval method
- ids - Sequence ID retrieval method
Example:
>>> import alm
>>>
>>> alm.idw("hello")
313191024
>>>
>>> alm.idw("<s>")
1
>>>
>>> alm.idw("</s>")
22
>>>
>>> alm.idw("<unk>")
3
>>>
>>> alm.idt("1424")
2
>>>
>>> alm.idt("hello")
0
>>>
>>> alm.idw("Living")
13268942501
>>>
>>> alm.idw("in")
2047
>>>
>>> alm.idw("the")
83201
>>>
>>> alm.idw("USA")
72549
>>>
>>> alm.ids([13268942501, 2047, 83201, 72549])
16314074810955466382
Description
Name |
Description |
〈s〉 |
Sentence beginning token |
〈/s〉 |
Sentence end token |
〈url〉 |
URL-address token |
〈num〉 |
Number (arabic or roman) token |
〈unk〉 |
Unknown word token |
〈time〉 |
Time token (15:44:56) |
〈score〉 |
Score count token (4:3 ¦ 01:04) |
〈fract〉 |
Fraction token (5/20 ¦ 192/864) |
〈date〉 |
Date token (18.07.2004 ¦ 07/18/2004) |
〈abbr〉 |
Abbreviation token (1-й ¦ 2-е ¦ 20-я ¦ p.s ¦ p.s.) |
〈dimen〉 |
Dimensions token (200x300 ¦ 1920x1080) |
〈range〉 |
Range of numbers token (1-2 ¦ 100-200 ¦ 300-400) |
〈aprox〉 |
Approximate number token (~93 ¦ 95.86 ¦ 1020) |
〈anum〉 |
Pseudo-number token (combination of numbers and other symbols) (T34 ¦ 895-M-86 ¦ 39km) |
〈pcards〉 |
Symbols of the play cards (♠ ¦ ♣ ¦ ♥ ¦ ♦ ) |
〈punct〉 |
Punctuation token (. ¦ , ¦ ? ¦ ! ¦ : ¦ ; ¦ … ¦ ¡ ¦ ¿) |
〈route〉 |
Direction symbols (arrows) (← ¦ ↑ ¦ ↓ ¦ ↔ ¦ ↵ ¦ ⇐ ¦ ⇑ ¦ ⇒ ¦ ⇓ ¦ ⇔ ¦ ◄ ¦ ▲ ¦ ► ¦ ▼) |
〈greek〉 |
Symbols of the Greek alphabet (Α ¦ Β ¦ Γ ¦ Δ ¦ Ε ¦ Ζ ¦ Η ¦ Θ ¦ Ι ¦ Κ ¦ Λ ¦ Μ ¦ Ν ¦ Ξ ¦ Ο ¦ Π ¦ Ρ ¦ Σ ¦ Τ ¦ Υ ¦ Φ ¦ Χ ¦ Ψ ¦ Ω) |
〈isolat〉 |
Isolation/quotation token (( ¦ ) ¦ [ ¦ ] ¦ { ¦ } ¦ " ¦ « ¦ » ¦ „ ¦ “ ¦ ` ¦ ⌈ ¦ ⌉ ¦ ⌊ ¦ ⌋ ¦ ‹ ¦ › ¦ ‚ ¦ ’ ¦ ′ ¦ ‛ ¦ ″ ¦ ‘ ¦ ” ¦ ‟ ¦ ' ¦〈 ¦ 〉) |
〈specl〉 |
Special character token (_ ¦ @ ¦ # ¦ № ¦ © ¦ ® ¦ & ¦ § ¦ æ ¦ ø ¦ Þ ¦ – ¦ ‾ ¦ ‑ ¦ — ¦ ¯ ¦ ¶ ¦ ˆ ¦ ˜ ¦ † ¦ ‡ ¦ • ¦ ‰ ¦ ⁄ ¦ ℑ ¦ ℘ ¦ ℜ ¦ ℵ ¦ ◊ ¦ \ ) |
〈currency〉 |
Symbols of world currencies ($ ¦ € ¦ ₽ ¦ ¢ ¦ £ ¦ ₤ ¦ ¤ ¦ ¥ ¦ ℳ ¦ ₣ ¦ ₴ ¦ ₸ ¦ ₹ ¦ ₩ ¦ ₦ ¦ ₭ ¦ ₪ ¦ ৳ ¦ ƒ ¦ ₨ ¦ ฿ ¦ ₫ ¦ ៛ ¦ ₮ ¦ ₱ ¦ ﷼ ¦ ₡ ¦ ₲ ¦ ؋ ¦ ₵ ¦ ₺ ¦ ₼ ¦ ₾ ¦ ₠ ¦ ₧ ¦ ₯ ¦ ₢ ¦ ₳ ¦ ₥ ¦ ₰ ¦ ₿ ¦ ұ) |
〈math〉 |
Mathematical operation token (+ ¦ - ¦ = ¦ / ¦ * ¦ ^ ¦ × ¦ ÷ ¦ − ¦ ∕ ¦ ∖ ¦ ∗ ¦ √ ¦ ∝ ¦ ∞ ¦ ∠ ¦ ± ¦ ¹ ¦ ² ¦ ³ ¦ ½ ¦ ⅓ ¦ ¼ ¦ ¾ ¦ % ¦ ~ ¦ · ¦ ⋅ ¦ ° ¦ º ¦ ¬ ¦ ƒ ¦ ∀ ¦ ∂ ¦ ∃ ¦ ∅ ¦ ∇ ¦ ∈ ¦ ∉ ¦ ∋ ¦ ∏ ¦ ∑ ¦ ∧ ¦ ∨ ¦ ∩ ¦ ∪ ¦ ∫ ¦ ∴ ¦ ∼ ¦ ≅ ¦ ≈ ¦ ≠ ¦ ≡ ¦ ≤ ¦ ≥ ¦ ª ¦ ⊂ ¦ ⊃ ¦ ⊄ ¦ ⊆ ¦ ⊇ ¦ ⊕ ¦ ⊗ ¦ ⊥ ¦ ¨) |
Methods:
- setZone - User zone set method
Example:
>>> import alm
>>>
>>> alm.setZone("com")
>>> alm.setZone("ru")
>>> alm.setZone("org")
>>> alm.setZone("net")
Methods:
- clear - Method clear all data
- setAlphabet - Method set alphabet
- getAlphabet - Method get alphabet
Example:
>>> import alm
>>>
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя'
>>>
>>> alm.clear()
>>>
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
Methods:
- setUnknown - Method set unknown word
- getUnknown - Method extraction unknown word
Example:
>>> import alm
>>>
>>> alm.setUnknown("word")
>>>
>>> alm.getUnknown()
'word'
Methods:
- info - Dictionary information output method
- init - Language Model Initialization Method signature: [smoothing = wittenBell, modified = False, prepares = False, mod = 0.0]
- token - Method for determining the type of the token words
- addText - Method of adding text for estimate
- collectCorpus - Training method of assembling the text data for ALM
- pruneVocab - Dictionary pruning method
- buildArpa - Method for build ARPA
- writeALM - Method for writing data from ARPA file to binary container
- writeWords - Method for writing these words to a file
- writeVocab - Method for writing dictionary data to a file
- writeNgrams - Method of writing data to NGRAMs files
- writeMap - Method of writing sequence map to file
- writeSuffix - Method for writing data to a suffix file for digital abbreviations
- writeAbbrs - Method for writing data to an abbreviation file
- getSuffixes - Method for extracting the list of suffixes of digital abbreviations
- writeArpa - Method of writing data to ARPA file
- setSize - Method for set size N-gram
- setLocale - Method set locale (Default: en_US.UTF-8)
- pruneArpa - Language model pruning method
- addWord - Method for add a word to the dictionary
- setThreads - Method for setting the number of threads used in work (0 - all available threads)
- setSubstitutes - Method for set letters to correct words from mixed alphabets
- addAbbr - Method add abbreviation
- setAbbrs - Method set abbreviations
- getAbbrs - Method for extracting the list of abbreviations
- addGoodword - Method add good word
- addBadword - Method add bad word
- readArpa - Method for reading an ARPA file, language model
- readVocab - Method of reading the dictionary
- setAdCw - Method for set dictionary characteristics (cw - count all words in dataset, ad - count all documents in dataset)
Description
Smoothing |
wittenBell |
addSmooth |
goodTuring |
constDiscount |
naturalDiscount |
kneserNey |
modKneserNey |
Example:
>>> import alm
>>>
>>> alm.info("./lm.alm")
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Name: Test Language Model
* Encryption: AES128
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz
* Build date: 09/18/2020 21:52:00
* N-gram size: 3
* Words: 9373
* N-grams: 25021
* Author: Some name
* Contacts: site: https://example.com, e-mail: info@example.com
* Copyright ©: You company LLC
* License type: MIT
* License text:
... License text ...
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
>>>
Example:
>>> import alm
>>> import json
>>>
>>> alm.setSize(3)
>>> alm.setThreads(0)
>>> alm.setLocale("en_US.UTF-8")
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> alm.setOption(alm.options_t.allowUnk)
>>> alm.setOption(alm.options_t.resetUnk)
>>> alm.setOption(alm.options_t.mixDicts)
>>> alm.setOption(alm.options_t.tokenWords)
>>> alm.setOption(alm.options_t.interpolate)
>>>
>>> alm.init(alm.smoothing_t.modKneserNey, True, True)
>>>
>>> p = alm.getParams()
>>> p.algorithm
4
>>> p.mod
0.0
>>> p.prepares
True
>>> p.modified
True
>>> alm.idw("Сбербанк")
13236490857
>>> alm.idw("Совкомбанк")
22287680895
>>>
>>> alm.token("Сбербанк")
'<unk>'
>>> alm.token("совкомбанк")
'<unk>'
>>>
>>> alm.setAbbrs({13236490857, 22287680895})
>>>
>>> alm.addAbbr("США")
>>> alm.addAbbr("Сбер")
>>>
>>> alm.token("Сбербанк")
'<abbr>'
>>> alm.token("совкомбанк")
'<abbr>'
>>>
>>> alm.token("сша")
'<abbr>'
>>> alm.token("СБЕР")
'<abbr>'
>>>
>>> alm.getAbbrs()
{13236490857, 189243, 22287680895, 26938511}
>>>
>>> alm.addGoodword("T-34")
>>> alm.addGoodword("АН-25")
>>>
>>> alm.addBadword("ийти")
>>> alm.addBadword("циган")
>>> alm.addBadword("апичатка")
>>>
>>> alm.addWord("министерство")
>>> alm.addWord("возмездие", 0, 1)
>>> alm.addWord("возражение", alm.idw("возражение"), 2)
>>>
>>> def status(text, status):
... print(text, status)
...
>>> def statusWriteALM(status):
... print("Write ALM", status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> def statusBuildArpa(status):
... print("Build ARPA", status)
...
>>> def statusPrune(status):
... print("Prune data", status)
...
>>> def statusWords(status):
... print("Write words", status)
...
>>> def statusVocab(status):
... print("Write vocab", status)
...
>>> def statusNgram(status):
... print("Write ngram", status)
...
>>> def statusMap(status):
... print("Write map", status)
...
>>> def statusSuffix(status):
... print("Write suffix", status)
...
>>> def statusAbbreviation(status):
... print("Write abbreviation", status)
...
>>> alm.addText("The future is now", 0)
>>>
>>> alm.collectCorpus("./correct.txt", status)
Read text corpora 0
Read text corpora 1
Read text corpora 2
Read text corpora 3
Read text corpora 4
Read text corpora 5
Read text corpora 6
...
>>> alm.pruneVocab(-15.0, 0, 0, statusPrune)
Prune data 0
Prune data 1
Prune data 2
Prune data 3
Prune data 4
Prune data 5
Prune data 6
...
>>> alm.pruneArpa(0.015, 3, statusPrune)
Prune data 0
Prune data 1
Prune data 2
Prune data 3
Prune data 4
Prune data 5
Prune data 6
...
>>> meta = {
... "aes": 128,
... "name": "Test Language Model",
... "author": "Some name",
... "lictype": "MIT",
... "password": "password",
... "copyright": "You company LLC",
... "lictext": "... License text ...",
... "contacts": "site: https://example.com, e-mail: info@example.com"
... }
>>>
>>> alm.writeALM("./lm.alm", json.dumps(meta), statusWriteALM)
Write ALM 0
Write ALM 0
Write ALM 0
Write ALM 0
Write ALM 0
Write ALM 0
...
>>> alm.writeWords("./words.txt", statusWords)
Write words 0
Write words 1
Write words 2
Write words 3
Write words 4
Write words 5
Write words 6
...
>>> alm.writeVocab("./lm.vocab", statusVocab)
Write vocab 0
Write vocab 1
Write vocab 2
Write vocab 3
Write vocab 4
Write vocab 5
Write vocab 6
...
>>> alm.writeNgrams("./lm.ngram", statusNgram)
Write ngram 0
Write ngram 1
Write ngram 2
Write ngram 3
Write ngram 4
Write ngram 5
Write ngram 6
...
>>> alm.writeMap("./lm.map", statusMap, "|")
Write map 0
Write map 1
Write map 2
Write map 3
Write map 4
Write map 5
Write map 6
...
>>> alm.writeSuffix("./suffix.txt", statusSuffix)
Write suffix 10
Write suffix 20
Write suffix 30
Write suffix 40
Write suffix 50
Write suffix 60
...
>>> alm.writeAbbrs("./words.abbr", statusAbbreviation)
Write abbreviation 25
Write abbreviation 50
Write abbreviation 75
Write abbreviation 100
...
>>> alm.getAbbrs()
{13236490857, 189243, 22287680895, 26938511}
>>>
>>> alm.getSuffixes()
{2633, 1662978425, 14279182218, 3468, 47, 28876661395, 29095464659, 2968, 57, 30}
>>>
>>> alm.buildArpa(statusBuildArpa)
Build ARPA 0
Build ARPA 1
Build ARPA 2
Build ARPA 3
Build ARPA 4
Build ARPA 5
Build ARPA 6
...
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
Methods:
- setOption - Library options setting method
- unsetOption - Disable module option method
Example:
>>> import alm
>>>
>>> alm.unsetOption(alm.options_t.debug)
>>> alm.unsetOption(alm.options_t.mixDicts)
>>> alm.unsetOption(alm.options_t.onlyGood)
>>> alm.unsetOption(alm.options_t.confidence)
...
Description
Options |
Description |
debug |
Flag debug mode |
stress |
Flag allowing to stress in words |
uppers |
Flag that allows you to correct the case of letters |
onlyGood |
Flag allowing to consider words from the white list only |
mixDicts |
Flag allowing the use of words consisting of mixed dictionaries |
allowUnk |
Flag allowing to unknown word |
resetUnk |
Flag to reset the frequency of an unknown word |
allGrams |
Flag allowing accounting of all collected n-grams |
lowerCase |
Flag allowing to case-insensitive |
confidence |
Flag ARPA file loading without pre-processing the words |
tokenWords |
Flag that takes into account when assembling N-grams, only those tokens that match words |
interpolate |
Flag allowing to use interpolation in estimating |
Methods:
- readMap - Method for reading sequence map from file
Example:
>>> import alm
>>>
>>> alm.setLocale("en_US.UTF-8")
>>>
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> alm.setOption(alm.options_t.allowUnk)
>>> alm.setOption(alm.options_t.resetUnk)
>>> alm.setOption(alm.options_t.mixDicts)
>>>
>>> def statusMap(text, status):
... print("Read map", text, status)
...
>>> def statusBuildArpa(status):
... print("Build ARPA", status)
...
>>> def statusPrune(status):
... print("Prune data", status)
...
>>> def statusVocab(text, status):
... print("Read Vocab", text, status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> alm.init(alm.smoothing_t.wittenBell)
>>>
>>> p = alm.getParams()
>>> p.algorithm
2
>>> alm.readVocab("./lm.vocab", statusVocab)
Read Vocab ./lm.vocab 0
Read Vocab ./lm.vocab 1
Read Vocab ./lm.vocab 2
Read Vocab ./lm.vocab 3
Read Vocab ./lm.vocab 4
Read Vocab ./lm.vocab 5
Read Vocab ./lm.vocab 6
...
>>> alm.readMap("./lm1.map", statusMap, "|")
Read map ./lm.map 0
Read map ./lm.map 1
Read map ./lm.map 2
Read map ./lm.map 3
Read map ./lm.map 4
Read map ./lm.map 5
Read map ./lm.map 6
...
>>> alm.readMap("./lm2.map", statusMap, "|")
Read map ./lm.map 0
Read map ./lm.map 1
Read map ./lm.map 2
Read map ./lm.map 3
Read map ./lm.map 4
Read map ./lm.map 5
Read map ./lm.map 6
...
>>> alm.pruneVocab(-15.0, 0, 0, statusPrune)
Prune data 0
Prune data 1
Prune data 2
Prune data 3
Prune data 4
Prune data 5
Prune data 6
...
>>> alm.buildArpa(statusBuildArpa)
Build ARPA 0
Build ARPA 1
Build ARPA 2
Build ARPA 3
Build ARPA 4
Build ARPA 5
Build ARPA 6
...
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
>>> def getWords(word, idw, oc, dc, count):
... print(word, idw, oc, dc, count)
... return True
...
>>> alm.words(getWords)
а 25 244 12 9373
б 26 11 6 9373
в 27 757 12 9373
ж 32 12 7 9373
и 34 823 12 9373
к 36 102 12 9373
о 40 63 12 9373
п 41 1 1 9373
р 42 1 1 9373
с 43 290 12 9373
у 45 113 12 9373
Х 47 1 1 9373
я 57 299 12 9373
D 61 1 1 9373
I 66 1 1 9373
да 2179 32 10 9373
за 2183 92 12 9373
на 2189 435 12 9373
па 2191 1 1 9373
та 2194 4 4 9373
об 2276 20 10 9373
...
>>> alm.getStatistic()
(13, 38124)
>> alm.setAdCw(44381, 20)
>>> alm.getStatistic()
(20, 44381)
Example:
>>> import alm
>>>
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> alm.setOption(alm.options_t.allowUnk)
>>> alm.setOption(alm.options_t.resetUnk)
>>> alm.setOption(alm.options_t.mixDicts)
>>>
>>> def statusBuildArpa(status):
... print("Build ARPA", status)
...
>>> def statusPrune(status):
... print("Prune data", status)
...
>>> def statusNgram(text, status):
... print("Read Ngram", text, status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> alm.init(alm.smoothing_t.addSmooth, False, False, 0.5)
>>>
>>> p = alm.getParams()
>>> p.algorithm
0
>>> p.mod
0.5
>>> p.prepares
False
>>> p.modified
False
>>>
>>> alm.readNgram("./lm.ngram", statusNgram)
Read Ngram ./lm.ngram 0
Read Ngram ./lm.ngram 1
Read Ngram ./lm.ngram 2
Read Ngram ./lm.ngram 3
Read Ngram ./lm.ngram 4
Read Ngram ./lm.ngram 5
Read Ngram ./lm.ngram 6
...
>>> alm.pruneVocab(-15.0, 0, 0, statusPrune)
Prune data 0
Prune data 1
Prune data 2
Prune data 3
Prune data 4
Prune data 5
Prune data 6
...
>>> alm.buildArpa(statusBuildArpa)
Build ARPA 0
Build ARPA 1
Build ARPA 2
Build ARPA 3
Build ARPA 4
Build ARPA 5
Build ARPA 6
...
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
Methods:
- modify - ARPA modification method
- sweep - ARPA Low Frequency N-gram Removal Method
- repair - Method of repair of previously calculated ARPA
Example:
>>> import alm
>>>
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> def statusSweep(text, status):
... print("Sweep n-grams", text, status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> alm.init()
>>>
>>> alm.sweep("./lm.arpa", statusSweep)
Sweep n-grams Read ARPA file 0
Sweep n-grams Read ARPA file 1
Sweep n-grams Read ARPA file 2
Sweep n-grams Read ARPA file 3
Sweep n-grams Read ARPA file 4
Sweep n-grams Read ARPA file 5
Sweep n-grams Read ARPA file 6
...
Sweep n-grams Sweep N-grams 0
Sweep n-grams Sweep N-grams 1
Sweep n-grams Sweep N-grams 2
Sweep n-grams Sweep N-grams 3
Sweep n-grams Sweep N-grams 4
Sweep n-grams Sweep N-grams 5
Sweep n-grams Sweep N-grams 6
...
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
>>> alm.clear()
>>>
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> def statusRepair(text, status):
... print("Repair n-grams", text, status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> alm.init()
>>>
>>> alm.repair("./lm.arpa", statusRepair)
Repair n-grams Read ARPA file 0
Repair n-grams Read ARPA file 1
Repair n-grams Read ARPA file 2
Repair n-grams Read ARPA file 3
Repair n-grams Read ARPA file 4
Repair n-grams Read ARPA file 5
Repair n-grams Read ARPA file 6
...
Repair n-grams Repair ARPA data 0
Repair n-grams Repair ARPA data 1
Repair n-grams Repair ARPA data 2
Repair n-grams Repair ARPA data 3
Repair n-grams Repair ARPA data 4
Repair n-grams Repair ARPA data 5
Repair n-grams Repair ARPA data 6
...
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
>>> alm.clear()
>>>
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> def statusModify(text, status):
... print("Modify ARPA data", text, status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> alm.init()
>>>
>>> alm.modify("./lm.arpa", "./remove.txt", alm.modify_t.remove, statusModify)
Modify ARPA data Read ARPA file 0
Modify ARPA data Read ARPA file 1
Modify ARPA data Read ARPA file 2
Modify ARPA data Read ARPA file 3
Modify ARPA data Read ARPA file 4
Modify ARPA data Read ARPA file 5
Modify ARPA data Read ARPA file 6
...
Modify ARPA data Modify ARPA data 3
Modify ARPA data Modify ARPA data 10
Modify ARPA data Modify ARPA data 15
Modify ARPA data Modify ARPA data 18
Modify ARPA data Modify ARPA data 24
Modify ARPA data Modify ARPA data 30
...
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
Modification flags
Name |
Description |
emplace |
Flag of adding n-gram into existing ARPA file |
remove |
Flag of removing n-gram from existing ARPA file |
change |
Flag of changing n-gram frequency in existing ARPA file |
replace |
Flag of replacing n-gram in existing ARPA file |
File of adding n-gram into existing ARPA file
-3.002006 США
-1.365296 границ США
-0.988534 у границ США
-1.759398 замуж за
-0.092796 собираюсь замуж за
-0.474876 и тоже
-19.18453 можно и тоже
...
N-gram frequency |
Separator |
N-gram |
-0.988534 |
\t |
у границ США |
File of changing n-gram frequency in existing ARPA file
-0.6588787 получайте удовольствие </s>
-0.6588787 только в одном
-0.6588787 работа связана с
-0.6588787 мужчины и женщины
-0.6588787 говоря про то
-0.6588787 потому что я
-0.6588787 потому что это
-0.6588787 работу потому что
-0.6588787 пейзажи за окном
-0.6588787 статусы для одноклассников
-0.6588787 вообще не хочу
...
N-gram frequency |
Separator |
N-gram |
-0.6588787 |
\t |
мужчины и женщины |
File of replacing n-gram in existing ARPA file
коем случае нельзя там да тут
но тем не да ты что
неожиданный у ожидаемый к
в СМИ в ФСБ
Шах Мат
...
Existing N-gram |
Separator |
New N-gram |
но тем не |
\t |
да ты что |
File of removing n-gram from existing ARPA file
ну то есть
ну очень большой
бы было если
мы с ней
ты смеешься над
два года назад
над тем что
или еще что-то
как я понял
как ни удивительно
как вы знаете
так и не
все-таки права
все-таки болят
все-таки сдохло
все-таки встала
все-таки решился
уже
мне
мое
все
...
Methods:
- mix - Multiple ARPA Interpolation Method [backward = True, forward = False]
- mix - Interpolation method of multiple arpa algorithms (Bayesian and Logarithmic-linear) [Bayes: length > 0, Loglinear: length == 0]
Example:
>>> import alm
>>>
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> def statusMix(text, status):
... print("Mix ARPA data", text, status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> alm.init()
>>>
>>> alm.mix(["./lm1.arpa", "./lm2.arpa"], [0.02, 0.05], True, statusMix)
Mix ARPA data ./lm1.arpa 0
Mix ARPA data ./lm1.arpa 1
Mix ARPA data ./lm1.arpa 2
Mix ARPA data ./lm1.arpa 3
Mix ARPA data ./lm1.arpa 4
Mix ARPA data ./lm1.arpa 5
Mix ARPA data ./lm1.arpa 6
...
Mix ARPA data 0
Mix ARPA data 1
Mix ARPA data 2
Mix ARPA data 3
Mix ARPA data 4
Mix ARPA data 5
Mix ARPA data 6
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
>>> alm.clear()
>>>
>>> alm.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> def statusMix(text, status):
... print("Mix ARPA data", text, status)
...
>>> def statusWriteArpa(status):
... print("Write ARPA", status)
...
>>> alm.init()
>>>
>>> alm.mix(["./lm1.arpa", "./lm2.arpa"], [0.02, 0.05], 0, 0.032, statusMix)
Mix ARPA data ./lm1.arpa 0
Mix ARPA data ./lm1.arpa 1
Mix ARPA data ./lm1.arpa 2
Mix ARPA data ./lm1.arpa 3
Mix ARPA data ./lm1.arpa 4
Mix ARPA data ./lm1.arpa 5
Mix ARPA data ./lm1.arpa 6
...
Mix ARPA data 0
Mix ARPA data 1
Mix ARPA data 2
Mix ARPA data 3
Mix ARPA data 4
Mix ARPA data 5
Mix ARPA data 6
>>> alm.writeArpa("./lm.arpa", statusWriteArpa)
Write ARPA 0
Write ARPA 1
Write ARPA 2
Write ARPA 3
Write ARPA 4
Write ARPA 5
Write ARPA 6
...
Methods:
- size - Method of obtaining the size of the N-gram
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.size()
3
Methods:
- damerauLevenshtein - Determination of the Damerau-Levenshtein distance in phrases
- distanceLevenshtein - Determination of Levenshtein distance in phrases
- tanimoto - Method for determining Jaccard coefficient (quotient - Tanimoto coefficient)
- needlemanWunsch - Word stretching method
Example:
>>> import alm
>>> alm.damerauLevenshtein("привет", "приветик")
2
>>>
>>> alm.damerauLevenshtein("приевтик", "приветик")
1
>>>
>>> alm.distanceLevenshtein("приевтик", "приветик")
2
>>>
>>> alm.tanimoto("привет", "приветик")
0.7142857142857143
>>>
>>> alm.tanimoto("привеитк", "приветик")
0.4
>>>
>>> alm.needlemanWunsch("привеитк", "приветик")
4
>>>
>>> alm.needlemanWunsch("привет", "приветик")
2
>>>
>>> alm.damerauLevenshtein("acre", "car")
2
>>> alm.distanceLevenshtein("acre", "car")
3
>>>
>>> alm.damerauLevenshtein("anteater", "theatre")
4
>>> alm.distanceLevenshtein("anteater", "theatre")
5
>>>
>>> alm.damerauLevenshtein("banana", "nanny")
3
>>> alm.distanceLevenshtein("banana", "nanny")
3
>>>
>>> alm.damerauLevenshtein("cat", "crate")
2
>>> alm.distanceLevenshtein("cat", "crate")
2
>>>
>>> alm.mulctLevenshtein("привет", "приветик")
4
>>>
>>> alm.mulctLevenshtein("приевтик", "приветик")
1
>>>
>>> alm.mulctLevenshtein("acre", "car")
3
>>>
>>> alm.mulctLevenshtein("anteater", "theatre")
5
>>>
>>> alm.mulctLevenshtein("banana", "nanny")
4
>>>
>>> alm.mulctLevenshtein("cat", "crate")
4
Methods:
- textToJson - Method to convert text to JSON
- isAllowApostrophe - Apostrophe permission check method
- switchAllowApostrophe - Method for permitting or denying an apostrophe as part of a word
Example:
>>> import alm
>>>
>>> def callbackFn(text):
... print(text)
...
>>> alm.isAllowApostrophe()
False
>>> alm.switchAllowApostrophe()
>>>
>>> alm.isAllowApostrophe()
True
>>> alm.textToJson("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", callbackFn)
[["«","On","nous","dit","qu'aujourd'hui","c'est","le","cas",",","encore","faudra-t-il","l'évaluer","»","l'astronomie"]]
Methods:
- jsonToText - Method to convert JSON to text
Example:
>>> import alm
>>>
>>> def callbackFn(text):
... print(text)
...
>>> alm.jsonToText('[["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"]]', callbackFn)
«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie
Methods:
- restore - Method for restore text from context
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.uppers)
>>>
>>> alm.restore(["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"])
"«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie"
Methods:
- allowStress - Method for allow using stress in words
- disallowStress - Method for disallow using stress in words
Example:
>>> import alm
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> def callbackFn(text):
... print(text)
...
>>> alm.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> alm.jsonToText('[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Белая стрела» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
>>>
>>> alm.allowStress()
>>> alm.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Бе́лая","стрела́","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> alm.jsonToText('[["«","Бе́лая","стрела́","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
>>>
>>> alm.disallowStress()
>>> alm.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> alm.jsonToText('[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Белая стрела» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
Methods:
- addBadword - Method add bad word
- setBadwords - Method set words to blacklist
- getBadwords - Method get words in blacklist
Example:
>>> import alm
>>>
>>> alm.setBadwords(["hello", "world", "test"])
>>>
>>> alm.getBadwords()
{1554834897, 2156498622, 28307030}
>>>
>>> alm.addBadword("test2")
>>>
>>> alm.getBadwords()
{5170183734, 1554834897, 2156498622, 28307030}
Example:
>>> import alm
>>>
>>> alm.setBadwords({24227504, 1219922507, 1794085167})
>>>
>>> alm.getBadwords()
{24227504, 1219922507, 1794085167}
>>>
>>> alm.clear(alm.clear_t.badwords)
>>>
>>> alm.getBadwords()
{}
Methods:
- addGoodword - Method add good word
- setGoodwords - Method set words to whitelist
- getGoodwords - Method get words in whitelist
Example:
>>> import alm
>>>
>>> alm.setGoodwords(["hello", "world", "test"])
>>>
>>> alm.getGoodwords()
{1554834897, 2156498622, 28307030}
>>>
>>> alm.addGoodword("test2")
>>>
>>> alm.getGoodwords()
{5170183734, 1554834897, 2156498622, 28307030}
>>>
>>> alm.clear(alm.clear_t.goodwords)
>>>
>> alm.getGoodwords()
{}
Example:
>>> import alm
>>>
>>> alm.setGoodwords({24227504, 1219922507, 1794085167})
>>>
>>> alm.getGoodwords()
{24227504, 1219922507, 1794085167}
Methods:
- setUserToken - Method for adding user token
- getUserTokens - User token list retrieval method
- getUserTokenId - Method for obtaining user token identifier
- getUserTokenWord - Method for obtaining a custom token by its identifier
Example:
>>> import alm
>>>
>>> alm.setUserToken("usa")
>>>
>>> alm.setUserToken("russia")
>>>
>>> alm.getUserTokenId("usa")
5759809081
>>>
>>> alm.getUserTokenId("russia")
9910674734
>>>
>>> alm.getUserTokens()
['usa', 'russia']
>>>
>>> alm.getUserTokenWord(5759809081)
'usa'
>>>
>>> alm.getUserTokenWord(9910674734)
'russia'
>>>
>> alm.clear(alm.clear_t.utokens)
>>>
>>> alm.getUserTokens()
[]
Methods:
- findNgram - N-gram search method in text
- word - "Method to extract a word by its identifier"
Example:
>>> import alm
>>>
>>> def callbackFn(text):
... print(text)
...
>>> alm.setOption(alm.options_t.confidence)
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.idw("привет")
2487910648
>>> alm.word(2487910648)
'привет'
>>>
>>> alm.findNgram("Особое место занимает чудотворная икона Лобзание Христа Иудою", callbackFn)
<s> Особое
Особое место
место занимает
занимает чудотворная
чудотворная икона
икона Лобзание
Лобзание Христа
Христа Иудою
Иудою </s>
>>>
Methods:
- setUserTokenMethod - Method for set a custom token processing function
Example:
>>> import alm
>>>
>>> def fn(token, word):
... if token and (token == "<usa>"):
... if word and (word.lower() == "usa"):
... return True
... elif token and (token == "<russia>"):
... if word and (word.lower() == "russia"):
... return True
... return False
...
>>> alm.setUserToken("usa")
>>>
>>> alm.setUserToken("russia")
>>>
>>> alm.setUserTokenMethod("usa", fn)
>>>
>>> alm.setUserTokenMethod("russia", fn)
>>>
>>> alm.idw("usa")
5759809081
>>>
>>> alm.idw("russia")
9910674734
>>>
>>> alm.getUserTokenWord(5759809081)
'usa'
>>>
>>> alm.getUserTokenWord(9910674734)
'russia'
Methods:
- setAlmV2 - Method for set the language model type ALMv2
- unsetAlmV2 - Method for unset the language model type ALMv2
- readALM - Method for reading data from a binary container
- setWordPreprocessingMethod - Method for set the word preprocessing function
Example:
>>> import alm
>>>
>>> alm.setAlmV2()
>>>
>>> def run(word, context):
... if word == "возле": word = "около"
... return word
...
>>> alm.setOption(alm.options_t.debug)
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.setWordPreprocessingMethod(run)
>>>
>>> a = alm.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
info: <s> Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор <punct> <punct> <punct> </s>
info: p( неожиданно | <s> ) = [2gram] 0.00038931 [ -3.40969900 ] / 0.99999991
info: p( из | неожиданно ...) = [2gram] 0.10110741 [ -0.99521700 ] / 0.99999979
info: p( подворотни | из ...) = [2gram] 0.00711798 [ -2.14764300 ] / 1.00000027
info: p( в | подворотни ...) = [2gram] 0.51077661 [ -0.29176900 ] / 1.00000021
info: p( олега | в ...) = [2gram] 0.00082936 [ -3.08125500 ] / 0.99999974
info: p( ударил | олега ...) = [2gram] 0.25002820 [ -0.60201100 ] / 0.99999978
info: p( яркий | ударил ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( прожектор | яркий ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( патрульный | прожектор ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( трактор | патрульный ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( <punct> | трактор ...) = [OOV] 0.00000000 [ -inf ] / 0.99999973
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993
info: 1 sentences, 13 words, 0 OOVs
info: 3 zeroprobs, logprob= -12.97624000 ppl= 8.45034200 ppl1= 9.95800426
info: <s> С лязгом выкатился и остановился около мальчика <punct> <punct> <punct> <punct> </s>
info: p( с | <s> ) = [2gram] 0.00642448 [ -2.19216200 ] / 0.99999991
info: p( лязгом | с ...) = [2gram] 0.00195917 [ -2.70792700 ] / 0.99999999
info: p( выкатился | лязгом ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( и | выкатился ...) = [2gram] 0.51169951 [ -0.29098500 ] / 1.00000024
info: p( остановился | и ...) = [2gram] 0.00143382 [ -2.84350600 ] / 0.99999975
info: p( около | остановился ...) = [1gram] 0.00011358 [ -3.94468000 ] / 1.00000003
info: p( мальчика | около ...) = [1gram] 0.00003932 [ -4.40541100 ] / 1.00000016
info: p( <punct> | мальчика ...) = [OOV] 0.00000000 [ -inf ] / 0.99999990
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993
info: 1 sentences, 11 words, 0 OOVs
info: 4 zeroprobs, logprob= -17.93030200 ppl= 31.20267541 ppl1= 42.66064865
>>> print(a.logprob)
-30.906542
Example:
>>> import alm
>>>
>>> alm.setAlmV2()
>>>
>>> alm.setOption(alm.options_t.debug)
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> def statusAlm(status):
... print("Read ALM", status)
...
>>> alm.readALM("./lm.alm", "password", 128, statusAlm)
Read ALM 0
Read ALM 1
Read ALM 2
Read ALM 3
Read ALM 4
Read ALM 5
Read ALM 6
...
>>>
>>> a = alm.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
info: <s> Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор <punct> <punct> <punct> </s>
info: p( неожиданно | <s> ) = [2gram] 0.00038931 [ -3.40969900 ] / 0.99999991
info: p( из | неожиданно ...) = [2gram] 0.10110741 [ -0.99521700 ] / 0.99999979
info: p( подворотни | из ...) = [2gram] 0.00711798 [ -2.14764300 ] / 1.00000027
info: p( в | подворотни ...) = [2gram] 0.51077661 [ -0.29176900 ] / 1.00000021
info: p( олега | в ...) = [2gram] 0.00082936 [ -3.08125500 ] / 0.99999974
info: p( ударил | олега ...) = [2gram] 0.25002820 [ -0.60201100 ] / 0.99999978
info: p( яркий | ударил ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( прожектор | яркий ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( патрульный | прожектор ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( трактор | патрульный ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( <punct> | трактор ...) = [OOV] 0.00000000 [ -inf ] / 0.99999973
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993
info: 1 sentences, 13 words, 0 OOVs
info: 3 zeroprobs, logprob= -12.97624000 ppl= 8.45034200 ppl1= 9.95800426
info: <s> С лязгом выкатился и остановился около мальчика <punct> <punct> <punct> <punct> </s>
info: p( с | <s> ) = [2gram] 0.00642448 [ -2.19216200 ] / 0.99999991
info: p( лязгом | с ...) = [2gram] 0.00195917 [ -2.70792700 ] / 0.99999999
info: p( выкатился | лязгом ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( и | выкатился ...) = [2gram] 0.51169951 [ -0.29098500 ] / 1.00000024
info: p( остановился | и ...) = [2gram] 0.00143382 [ -2.84350600 ] / 0.99999975
info: p( около | остановился ...) = [1gram] 0.00011358 [ -3.94468000 ] / 1.00000003
info: p( мальчика | около ...) = [1gram] 0.00003932 [ -4.40541100 ] / 1.00000016
info: p( <punct> | мальчика ...) = [OOV] 0.00000000 [ -inf ] / 0.99999990
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993
info: 1 sentences, 11 words, 0 OOVs
info: 4 zeroprobs, logprob= -17.93030200 ppl= 31.20267541 ppl1= 42.66064865
>>> print(a.logprob)
-30.906542
Methods:
- setLogfile - Method of set the file for log output
- setOOvFile - Method set file for saving OOVs words
Example:
>>> import alm
>>>
>>> alm.setLogfile("./log.txt")
>>>
>>> alm.setOOvFile("./oov.txt")
Methods:
- perplexity - Perplexity calculation
- pplConcatenate - Method of combining perplexia
- pplByFiles - Method for reading perplexity calculation by file or group of files
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> a = alm.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
>>>
>>> print(a.logprob)
-30.906542
>>>
>>> print(a.oovs)
0
>>>
>>> print(a.words)
24
>>>
>>> print(a.sentences)
2
>>>
>>> print(a.zeroprobs)
7
>>>
>>> print(a.ppl)
17.229063831108224
>>>
>>> print(a.ppl1)
19.398698060810077
>>>
>>> b = alm.pplByFiles("./text.txt")
>>>
>>> c = alm.pplConcatenate(a, b)
>>>
>>> print(c.ppl)
7.384123548831112
Description
Name |
Description |
ppl |
The meaning of perplexity without considering the beginning of the sentence |
ppl1 |
The meaning of perplexion taking into account the beginning of the sentence |
oovs |
Count of oov words |
words |
Count of words in sentence |
logprob |
Word sequence frequency |
sentences |
Count of sequences |
zeroprobs |
Count of zero probs |
Methods:
- tokenization - Method for breaking text into tokens
Example:
>>> import alm
>>>
>>> def tokensFn(word, context, reset, stop):
... print(word, " => ", context)
... return True
...
>>> alm.switchAllowApostrophe()
>>>
>>> alm.tokenization("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", tokensFn)
« => []
On => ['«']
nous => ['«', 'On']
dit => ['«', 'On', 'nous']
qu'aujourd'hui => ['«', 'On', 'nous', 'dit']
c'est => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui"]
le => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est"]
cas => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le']
, => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas']
encore => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',']
faudra-t-il => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore']
l => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l']
' => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l']
évaluer => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'"]
» => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer']
l => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»']
' => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»', 'l']
astronomie => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»', 'l', "'"]
Methods:
- setTokenizerFn - Method for set the function of an external tokenizer
Example:
>>> import alm
>>>
>>> def tokenizerFn(text, callback):
... word = ""
... context = []
... for letter in text:
... if letter == " " and len(word) > 0:
... if not callback(word, context, False, False): return
... context.append(word)
... word = ""
... elif letter == "." or letter == "!" or letter == "?":
... if not callback(word, context, True, False): return
... word = ""
... context = []
... else:
... word += letter
... if len(word) > 0:
... if not callback(word, context, False, True): return
...
>>> def tokensFn(word, context, reset, stop):
... print(word, " => ", context)
... return True
...
>>> alm.setTokenizerFn(tokenizerFn)
>>>
>>> alm.tokenization("Hello World today!", tokensFn)
Hello => []
World => ['Hello']
today => ['Hello', 'World']
Methods:
- sentences - Sentences generation method
- sentencesToFile - Method for assembling a specified number of sentences and writing to a file
Example:
>>> import alm
>>>
>>> def sentencesFn(text):
... print("Sentences:", text)
... return True
...
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.sentences(sentencesFn)
Sentences: <s> В общем </s>
Sentences: <s> С лязгом выкатился и остановился возле мальчика </s>
Sentences: <s> У меня нет </s>
Sentences: <s> Я вообще не хочу </s>
Sentences: <s> Да и в общем </s>
Sentences: <s> Не могу </s>
Sentences: <s> Ну в общем </s>
Sentences: <s> Так что я вообще не хочу </s>
Sentences: <s> Потому что я вообще не хочу </s>
Sentences: <s> Продолжение следует </s>
Sentences: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор </s>
>>>
>>> alm.sentencesToFile(5, "./result.txt")
Methods:
- fixUppers - Method for correcting registers in the text
- fixUppersByFiles - Method for correcting text registers in a text file
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.fixUppers("неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
'Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? С лязгом выкатился и остановился возле мальчика....'
>>>
>>> alm.fixUppersByFiles("./corpus", "./result.txt", "txt")
Methods:
- checkHypLat - Hyphen and latin character search method
Example:
>>> import alm
>>>
>>> alm.checkHypLat("Hello-World")
(True, True)
>>>
>>> alm.checkHypLat("Hello")
(False, True)
>>>
>>> alm.checkHypLat("Привет")
(False, False)
>>>
>>> alm.checkHypLat("так-как")
(True, False)
Methods:
- getUppers - Method for extracting registers for each word
- countLetter - Method for counting the amount of a specific letter in a word
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.idw("Living")
10493385932
>>>
>>> alm.idw("in")
3301
>>>
>>> alm.idw("the")
217280
>>>
>>> alm.idw("USA")
188643
>>>
>>> alm.getUppers([10493385932, 3301, 217280, 188643])
[1, 0, 0, 7]
>>>
>>> alm.countLetter("hello-world", "-")
1
>>>
>>> alm.countLetter("hello-world", "l")
3
Methods:
- urls - Method for extracting URL address coordinates in a string
Example:
>>> import alm
>>>
>>> alm.urls("This website: example.com was designed with ...")
{14: 25}
>>>
>>> alm.urls("This website: https://a.b.c.example.net?id=52#test-1 was designed with ...")
{14: 52}
>>>
>>> alm.urls("This website: https://a.b.c.example.net?id=52#test-1 and 127.0.0.1 was designed with ...")
{14: 52, 57: 66}
Methods:
- roman2Arabic - Method for translating Roman numerals to Arabic
Example:
>>> import alm
>>>
>>> alm.roman2Arabic("XVI")
16
Methods:
- rest - Method for correction and detection of words with mixed alphabets
- setSubstitutes - Method for set letters to correct words from mixed alphabets
- getSubstitutes - Method of extracting letters to correct words from mixed alphabets
Example:
>>> import alm
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> alm.getSubstitutes()
{'a': 'а', 'b': 'в', 'c': 'с', 'e': 'е', 'h': 'н', 'k': 'к', 'm': 'м', 'o': 'о', 'p': 'р', 't': 'т', 'x': 'х'}
>>>
>>> str = "ПPИBETИК"
>>>
>>> str.lower()
'пpиbetик'
>>>
>>> alm.rest(str)
'приветик'
Methods:
- setTokensDisable - Method for set the list of forbidden tokens
- setTokensUnknown - Method for set the list of tokens cast to 〈unk〉
- setTokenDisable - Method for set the list of unidentifiable tokens
- setTokenUnknown - Method of set the list of tokens that need to be identified as 〈unk〉
- getTokensDisable - Method for retrieving the list of forbidden tokens
- getTokensUnknown - Method for extracting a list of tokens reducible to 〈unk〉
- setAllTokenDisable - Method for set all tokens as unidentifiable
- setAllTokenUnknown - The method of set all tokens identified as 〈unk〉
Example:
>>> import alm
>>>
>>> alm.idw("<date>")
6
>>>
>>> alm.idw("<time>")
7
>>>
>>> alm.idw("<abbr>")
5
>>>
>>> alm.idw("<math>")
9
>>>
>>> alm.setTokenDisable("date|time|abbr|math")
>>>
>>> alm.getTokensDisable()
{9, 5, 6, 7}
>>>
>>> alm.setTokensDisable({6, 7, 5, 9})
>>>
>>> alm.setTokenUnknown("date|time|abbr|math")
>>>
>>> alm.getTokensUnknown()
{9, 5, 6, 7}
>>>
>>> alm.setTokensUnknown({6, 7, 5, 9})
>>>
>>> alm.setAllTokenDisable()
>>>
>>> alm.getTokensDisable()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23}
>>>
>>> alm.setAllTokenUnknown()
>>>
>>> alm.getTokensUnknown()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23}
Methods:
- countAlphabet - Method of obtaining the number of letters in the dictionary
Example:
>>> import alm
>>>
>>> alm.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>>
>>> alm.countAlphabet()
26
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.countAlphabet()
59
Methods:
- countBigrams - Method get count bigrams
- countTrigrams - Method get count trigrams
- countGrams - Method get count N-gram by lm size
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.countBigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
12
>>>
>>> alm.countTrigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>>
>>> alm.size()
3
>>>
>>> alm.countGrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>>
>>> alm.idw("неожиданно")
3263936167
>>>
>>> alm.idw("из")
5134
>>>
>>> alm.idw("подворотни")
12535356101
>>>
>>> alm.idw("в")
53
>>>
>>> alm.idw("Олега")
2824508300
>>>
>>> alm.idw("ударил")
24816796913
>>>
>>> alm.countBigrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
5
>>>
>>> alm.countTrigrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
4
>>>
>>> alm.countGrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
4
Methods:
- arabic2Roman - Convert arabic number to roman number
Example:
>>> import alm
>>>
>>> alm.arabic2Roman(23)
'XXIII'
>>>
>>> alm.arabic2Roman("33")
'XXXIII'
Methods:
- setThreads - Method for set the number of threads (0 - all threads)
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.setThreads(3)
>>>
>>> a = alm.pplByFiles("./text.txt")
>>>
>>> print(a.logprob)
-48201.29481399994
Methods:
- fti - Method for removing the fractional part of a number
Example:
>>> import alm
>>>
>>> alm.fti(5892.4892)
5892489200000
>>>
>>> alm.fti(5892.4892, 4)
58924892
Methods:
- context - Method for assembling text context from a sequence
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.idw("неожиданно")
3263936167
>>>
>>> alm.idw("из")
5134
>>>
>>> alm.idw("подворотни")
12535356101
>>>
>>> alm.idw("в")
53
>>>
>>> alm.idw("Олега")
2824508300
>>>
>>> alm.idw("ударил")
24816796913
>>>
>>> alm.context([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
'Неожиданно из подворотни в Олега ударил'
Methods:
- isAbbr - Method of checking a word for compliance with an abbreviation
- isSuffix - Method for checking a word for a suffix of a numeric abbreviation
- isToken - Method for checking if an identifier matches a token
- isIdWord - Method for checking if an identifier matches a word
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.addAbbr("США")
>>>
>>> alm.isAbbr("сша")
True
>>>
>>> alm.addSuffix("1-я")
>>>
>>> alm.isSuffix("1-я")
True
>>>
>>> alm.isToken(alm.idw("США"))
True
>>>
>>> alm.isToken(alm.idw("1-я"))
True
>>>
>>> alm.isToken(alm.idw("125"))
True
>>>
>>> alm.isToken(alm.idw("<s>"))
True
>>>
>>> alm.isToken(alm.idw("Hello"))
False
>>>
>>> alm.isIdWord(alm.idw("https://anyks.com"))
True
>>>
>>> alm.isIdWord(alm.idw("Hello"))
True
>>>
>>> alm.isIdWord(alm.idw("-"))
False
Methods:
- findByFiles - Method search N-grams in a text file
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.debug)
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.findByFiles("./text.txt", "./result.txt")
info: <s> Кукай
сари кукай
сари японские
японские каллиграфы
каллиграфы я
я постоянно
постоянно навещал
навещал их
их тайно
тайно от
от людей
людей </s>
info: <s> Неожиданно из
Неожиданно из подворотни
из подворотни в
подворотни в Олега
в Олега ударил
Олега ударил яркий
ударил яркий прожектор
яркий прожектор патрульный
прожектор патрульный трактор
патрульный трактор
<s> С лязгом
С лязгом выкатился
лязгом выкатился и
выкатился и остановился
и остановился возле
остановился возле мальчика
возле мальчика
Methods:
- checkSequence - Sequence Existence Method
- existSequence - Method for checking the existence of a sequence, excluding non-word tokens
- checkByFiles - Method for checking if a sequence exists in a text file
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.debug)
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.addAbbr("США")
>>>
>>> alm.isAbbr("сша")
>>>
>>> alm.checkSequence("Неожиданно из подворотни в олега ударил")
True
>>>
>>> alm.checkSequence("Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором")
True
>>>
>>> alm.checkSequence("Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором", True)
True
>>>
>>> alm.checkSequence("в Олега ударил яркий")
True
>>>
>>> alm.checkSequence("в Олега ударил яркий", True)
True
>>>
>>> alm.checkSequence("от госсекретаря США")
True
>>>
>>> alm.checkSequence("от госсекретаря США", True)
True
>>>
>>> alm.checkSequence("Неожиданно из подворотни в олега ударил", 2)
True
>>>
>>> alm.checkSequence(["Неожиданно","из","подворотни","в","олега","ударил"], 2)
True
>>>
>>> alm.existSequence("<s> Сегодня сыграл и в, Олега ударил яркий прожектор, патрульный трактор - с корпоративным сектором </s>", 2)
(True, 0)
>>>
>>> alm.existSequence(["<s>","Сегодня","сыграл","и","в",",","Олега","ударил","яркий","прожектор",",","патрульный","трактор","-","с","корпоративным","сектором","</s>"], 2)
(True, 2)
>>>
>>> alm.idw("от")
6086
>>>
>>> alm.idw("госсекретаря")
51273912082
>>>
>>> alm.idw("США")
5
>>>
>>> alm.checkSequence([6086, 51273912082, 5])
True
>>>
>>> alm.checkSequence([6086, 51273912082, 5], True)
True
>>>
>>> alm.checkSequence(["от", "госсекретаря", "США"])
True
>>>
>>> alm.checkSequence(["от", "госсекретаря", "США"], True)
True
>>>
>>> alm.checkByFiles("./text.txt", "./result.txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались
info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>>
>>> alm.checkByFiles("./corpus", "./result.txt", False, "txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались
info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>>
>>> alm.checkByFiles("./corpus", "./result.txt", True, "txt")
info: 2000 | NO | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2001 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2002 | NO | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2004 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2005 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 0
Not exists texts: 2007
Methods:
- check - String Check Method
- match - String Matching Method
- addAbbr - Method add abbreviation
- addSuffix - Method add number suffix abbreviation
- setSuffixes - Method set number suffix abbreviations
- readSuffix - Method for reading data from a file of suffixes and abbreviations
Example:
>>> import alm
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> alm.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> alm.check("Дом-2", alm.check_t.home2)
True
>>>
>>> alm.check("Дом2", alm.check_t.home2)
False
>>>
>>> alm.check("Дом-2", alm.check_t.latian)
False
>>>
>>> alm.check("Hello", alm.check_t.latian)
True
>>>
>>> alm.check("прiвет", alm.check_t.latian)
True
>>>
>>> alm.check("Дом-2", alm.check_t.hyphen)
True
>>>
>>> alm.check("Дом2", alm.check_t.hyphen)
False
>>>
>>> alm.check("Д", alm.check_t.letter)
True
>>>
>>> alm.check("$", alm.check_t.letter)
False
>>>
>>> alm.check("-", alm.check_t.letter)
False
>>>
>>> alm.check("просtоквaшино", alm.check_t.similars)
True
>>>
>>> alm.match("my site http://example.ru, it's true", alm.match_t.url)
True
>>>
>>> alm.match("по вашему ip адресу 46.40.123.12 проводится проверка", alm.match_t.url)
True
>>>
>>> alm.match("мой адрес в формате IPv6: http://[2001:0db8:11a3:09d7:1f34:8a2e:07a0:765d]/", alm.match_t.url)
True
>>>
>>> alm.match("13-я", alm.match_t.abbr)
True
>>>
alm.match("13-я-й", alm.match_t.abbr)
False
>>>
alm.match("т.д", alm.match_t.abbr)
True
>>>
alm.match("т.п.", alm.match_t.abbr)
True
>>>
>>> alm.match("С.Ш.А.", alm.match_t.abbr)
True
>>>
>>> alm.addAbbr("сша")
>>> alm.match("США", alm.match_t.abbr)
True
>>>
>>> alm.addSuffix("15-летия")
>>> alm.match("15-летия", alm.match_t.abbr)
True
>>>
>>> alm.getSuffixes()
{3139900457}
>>>
>>> alm.idw("лет")
328041
>>>
>>> alm.idw("тых")
352214
>>>
>>> alm.setSuffixes({328041, 352214})
>>>
>>> alm.getSuffixes()
{328041, 352214}
>>>
>>> def status(status):
... print(status)
...
>>> alm.readSuffix("./suffix.abbr", status)
>>>
>>> alm.match("15-лет", alm.match_t.abbr)
True
>>>
>>> alm.match("20-тых", alm.match_t.abbr)
True
>>>
>>> alm.match("15-летия", alm.match_t.abbr)
False
>>>
>>> alm.match("Hello", alm.match_t.latian)
True
>>>
>>> alm.match("прiвет", alm.match_t.latian)
False
>>>
>>> alm.match("23424", alm.match_t.number)
True
>>>
>>> alm.match("hello", alm.match_t.number)
False
>>>
>>> alm.match("23424.55", alm.match_t.number)
False
>>>
>>> alm.match("23424", alm.match_t.decimal)
False
>>>
>>> alm.match("23424.55", alm.match_t.decimal)
True
>>>
>>> alm.match("23424,55", alm.match_t.decimal)
True
>>>
>>> alm.match("-23424.55", alm.match_t.decimal)
True
>>>
>>> alm.match("+23424.55", alm.match_t.decimal)
True
>>>
>>> alm.match("+23424.55", alm.match_t.anumber)
True
>>>
>>> alm.match("15T-34", alm.match_t.anumber)
True
>>>
>>> alm.match("hello", alm.match_t.anumber)
False
>>>
>>> alm.match("hello", alm.match_t.allowed)
True
>>>
>>> alm.match("évaluer", alm.match_t.allowed)
False
>>>
>>> alm.match("13", alm.match_t.allowed)
True
>>>
>>> alm.match("Hello-World", alm.match_t.allowed)
True
>>>
>>> alm.match("Hello", alm.match_t.math)
False
>>>
>>> alm.match("+", alm.match_t.math)
True
>>>
>>> alm.match("=", alm.match_t.math)
True
>>>
>>> alm.match("Hello", alm.match_t.upper)
True
>>>
>>> alm.match("hello", alm.match_t.upper)
False
>>>
>>> alm.match("hellO", alm.match_t.upper)
False
>>>
>>> alm.match("a", alm.match_t.punct)
False
>>>
>>> alm.match(",", alm.match_t.punct)
True
>>>
>>> alm.match(" ", alm.match_t.space)
True
>>>
>>> alm.match("a", alm.match_t.space)
False
>>>
>>> alm.match("a", alm.match_t.special)
False
>>>
>>> alm.match("±", alm.match_t.special)
False
>>>
>>> alm.match("[", alm.match_t.isolation)
True
>>>
>>> alm.match("a", alm.match_t.isolation)
False
>>>
>>> alm.match("a", alm.match_t.greek)
False
>>>
>>> alm.match("Ψ", alm.match_t.greek)
True
>>>
>>> alm.match("->", alm.match_t.route)
False
>>>
>>> alm.match("⇔", alm.match_t.route)
True
>>>
>>> alm.match("a", alm.match_t.letter)
True
>>>
>>> alm.match("!", alm.match_t.letter)
False
>>>
>>> alm.match("!", alm.match_t.pcards)
False
>>>
>>> alm.match("♣", alm.match_t.pcards)
True
>>>
>>> alm.match("p", alm.match_t.currency)
False
>>>
>>> alm.match("$", alm.match_t.currency)
True
>>>
>>> alm.match("€", alm.match_t.currency)
True
>>>
>>> alm.match("₽", alm.match_t.currency)
True
>>>
>>> alm.match("₿", alm.match_t.currency)
True
Methods:
- delInText - Method for delete letter in text
Example:
>>> import alm
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", alm.wdel_t.punct)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>>
>>> alm.delInText("hello-world-hello-world", alm.wdel_t.hyphen)
'helloworldhelloworld'
>>>
>>> alm.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", alm.wdel_t.broken)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>>
>>> alm.delInText("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", alm.wdel_t.broken)
"On nous dit qu'aujourd'hui c'est le cas encore faudra-t-il l'valuer l'astronomie"
Methods:
- countsByFiles - Method for counting the number of n-grams in a text file
Example:
>>> import alm
>>>
>>> alm.setOption(alm.options_t.debug)
>>>
>>> alm.setOption(alm.options_t.confidence)
>>>
>>> alm.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> alm.readArpa('./lm.arpa')
>>>
>>> alm.countsByFiles("./text.txt", "./result.txt", 3)
info: 0 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 0 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
Counts 3grams: 471
>>>
>>> alm.countsByFiles("./corpus", "./result.txt", 2, "txt")
info: 19 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 10 | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 27 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
Counts 2grams: 20270
Description
N-gram size |
Description |
1 |
language model size |
2 |
bigram |
3 |
trigram |