ANYKS Spell-checker (ASC)
Project description
There are a lot of typo and text error correction systems out there. Each one of those systems has its pros and cons, and each system has the right to live and will find its own user base. I would like to present my own version of the typo correction system with its own unique features.
List of features
- Correction of mistakes in words with a Levenshtein distance of up to 4;
- Correction of different types of typos in words: insertion, deletion, substitution, rearrangement of character;
- Ё-fication of a word given the context (letter 'ё' is commonly replaced by letter 'е' in russian typed text);
- Context-based word capitalization for proper names and titles;
- Context-based splitting for words that are missing the separating space character;
- Text analysis without correcting the original text;
- Searching the text for errors, typos, incorrect context.
Requirements
Install PyBind11
$ python3 -m pip install pybind11
Ready-to-use dictionaries
Testing
To test the system, we used data from the 2016 "spelling correction" competition organized by Dialog21.
The trained binary dictionary that was used for testing: wittenbell-3-middle.asc
Mode |
Precision |
Recall |
FMeasure |
Typo correction |
76.97 |
62.71 |
69.11 |
Error correction |
73.72 |
60.53 |
66.48 |
I think it is unnecessary to add any other data. Anyone can repeat the test if they wish (all files used for testing are attached below).
Files used for testing
Description of Methods
Methods:
- idw - Word ID retrieval method
- idt - Token ID retrieval method
- ids - Sequence ID retrieval method
Example:
>>> import asc
>>>
>>> asc.idw("hello")
313191024
>>>
>>> asc.idw("<s>")
1
>>>
>>> asc.idw("</s>")
22
>>>
>>> asc.idw("<unk>")
3
>>>
>>> asc.idt("1424")
2
>>>
>>> asc.idt("hello")
0
>>>
>>> asc.idw("Living")
13268942501
>>>
>>> asc.idw("in")
2047
>>>
>>> asc.idw("the")
83201
>>>
>>> asc.idw("USA")
72549
>>>
>>> asc.ids([13268942501, 2047, 83201, 72549])
16314074810955466382
Description
Name |
Description |
〈s〉 |
Sentence beginning token |
〈/s〉 |
Sentence end token |
〈url〉 |
URL-address token |
〈num〉 |
Number (arabic or roman) token |
〈unk〉 |
Unknown word token |
〈time〉 |
Time token (15:44:56) |
〈score〉 |
Score count token (4:3 ¦ 01:04) |
〈fract〉 |
Fraction token (5/20 ¦ 192/864) |
〈date〉 |
Date token (18.07.2004 ¦ 07/18/2004) |
〈abbr〉 |
Abbreviation token (1-й ¦ 2-е ¦ 20-я ¦ p.s ¦ p.s.) |
〈dimen〉 |
Dimensions token (200x300 ¦ 1920x1080) |
〈range〉 |
Range of numbers token (1-2 ¦ 100-200 ¦ 300-400) |
〈aprox〉 |
Approximate number token (~93 ¦ 95.86 ¦ 1020) |
〈anum〉 |
Pseudo-number token (combination of numbers and other symbols) (T34 ¦ 895-M-86 ¦ 39km) |
〈pcards〉 |
Symbols of the play cards (♠ ¦ ♣ ¦ ♥ ¦ ♦ ) |
〈punct〉 |
Punctuation token (. ¦ , ¦ ? ¦ ! ¦ : ¦ ; ¦ … ¦ ¡ ¦ ¿) |
〈route〉 |
Direction symbols (arrows) (← ¦ ↑ ¦ ↓ ¦ ↔ ¦ ↵ ¦ ⇐ ¦ ⇑ ¦ ⇒ ¦ ⇓ ¦ ⇔ ¦ ◄ ¦ ▲ ¦ ► ¦ ▼) |
〈greek〉 |
Symbols of the Greek alphabet (Α ¦ Β ¦ Γ ¦ Δ ¦ Ε ¦ Ζ ¦ Η ¦ Θ ¦ Ι ¦ Κ ¦ Λ ¦ Μ ¦ Ν ¦ Ξ ¦ Ο ¦ Π ¦ Ρ ¦ Σ ¦ Τ ¦ Υ ¦ Φ ¦ Χ ¦ Ψ ¦ Ω) |
〈isolat〉 |
Isolation/quotation token (( ¦ ) ¦ [ ¦ ] ¦ { ¦ } ¦ " ¦ « ¦ » ¦ „ ¦ “ ¦ ` ¦ ⌈ ¦ ⌉ ¦ ⌊ ¦ ⌋ ¦ ‹ ¦ › ¦ ‚ ¦ ’ ¦ ′ ¦ ‛ ¦ ″ ¦ ‘ ¦ ” ¦ ‟ ¦ ' ¦〈 ¦ 〉) |
〈specl〉 |
Special character token (_ ¦ @ ¦ # ¦ № ¦ © ¦ ® ¦ & ¦ § ¦ æ ¦ ø ¦ Þ ¦ – ¦ ‾ ¦ ‑ ¦ — ¦ ¯ ¦ ¶ ¦ ˆ ¦ ˜ ¦ † ¦ ‡ ¦ • ¦ ‰ ¦ ⁄ ¦ ℑ ¦ ℘ ¦ ℜ ¦ ℵ ¦ ◊ ¦ \ ) |
〈currency〉 |
Symbols of world currencies ($ ¦ € ¦ ₽ ¦ ¢ ¦ £ ¦ ₤ ¦ ¤ ¦ ¥ ¦ ℳ ¦ ₣ ¦ ₴ ¦ ₸ ¦ ₹ ¦ ₩ ¦ ₦ ¦ ₭ ¦ ₪ ¦ ৳ ¦ ƒ ¦ ₨ ¦ ฿ ¦ ₫ ¦ ៛ ¦ ₮ ¦ ₱ ¦ ﷼ ¦ ₡ ¦ ₲ ¦ ؋ ¦ ₵ ¦ ₺ ¦ ₼ ¦ ₾ ¦ ₠ ¦ ₧ ¦ ₯ ¦ ₢ ¦ ₳ ¦ ₥ ¦ ₰ ¦ ₿ ¦ ұ) |
〈math〉 |
Mathematical operation token (+ ¦ - ¦ = ¦ / ¦ * ¦ ^ ¦ × ¦ ÷ ¦ − ¦ ∕ ¦ ∖ ¦ ∗ ¦ √ ¦ ∝ ¦ ∞ ¦ ∠ ¦ ± ¦ ¹ ¦ ² ¦ ³ ¦ ½ ¦ ⅓ ¦ ¼ ¦ ¾ ¦ % ¦ ~ ¦ · ¦ ⋅ ¦ ° ¦ º ¦ ¬ ¦ ƒ ¦ ∀ ¦ ∂ ¦ ∃ ¦ ∅ ¦ ∇ ¦ ∈ ¦ ∉ ¦ ∋ ¦ ∏ ¦ ∑ ¦ ∧ ¦ ∨ ¦ ∩ ¦ ∪ ¦ ∫ ¦ ∴ ¦ ∼ ¦ ≅ ¦ ≈ ¦ ≠ ¦ ≡ ¦ ≤ ¦ ≥ ¦ ª ¦ ⊂ ¦ ⊃ ¦ ⊄ ¦ ⊆ ¦ ⊇ ¦ ⊕ ¦ ⊗ ¦ ⊥ ¦ ¨) |
Methods:
- setZone - User zone set method
Example:
>>> import asc
>>>
>>> asc.setZone("com")
>>> asc.setZone("ru")
>>> asc.setZone("org")
>>> asc.setZone("net")
Methods:
- clear - Method clear all data
- setAlphabet - Method set alphabet
- getAlphabet - Method get alphabet
Example:
>>> import asc
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя'
>>>
>>> asc.clear()
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
Methods:
- setUnknown - Method set unknown word
- getUnknown - Method extraction unknown word
Example:
>>> import asc
>>>
>>> asc.setUnknown("word")
>>>
>>> asc.getUnknown()
'word'
Methods:
- infoIndex - Method for print information about the dictionary
- token - Method for determining the type of the token words
- addText - Method of adding text for estimate
- collectCorpus - Training method of assembling the text data for ASC [curpus = filename or dir, smoothing = wittenBell, modified = False, prepares = False, mod = 0.0, status = Null]
- pruneVocab - Dictionary pruning method
- buildArpa - Method for build ARPA
- writeWords - Method for writing these words to a file
- writeVocab - Method for writing dictionary data to a file
- writeNgrams - Method of writing data to NGRAMs files
- writeMap - Method of writing sequence map to file
- writeSuffix - Method for writing data to a suffix file for digital abbreviations
- writeAbbrs - Method for writing data to an abbreviation file
- getSuffixes - Method for extracting the list of suffixes of digital abbreviations
- writeArpa - Method of writing data to ARPA file
- setThreads - Method for setting the number of threads used in work (0 - all available threads)
- setStemmingMethod - Method for setting external stemming function
- loadIndex - Binary index loading method
- spell - Method for performing spell-checker
- analyze - Method for analyze text
- addAlt - Method for add a word/letter with an alternative letter
- setAlphabet - Method for set Alphabet
- setPilots - Method for set pilot words
- setSubstitutes - Method for set letters to correct words from mixed alphabets
- addAbbr - Method add abbreviation
- setAbbrs - Method set abbreviations
- getAbbrs - Method for extracting the list of abbreviations
- addGoodword - Method add good word
- addBadword - Method add bad word
- addUWord - Method for add a word that always starts with a capital letter
- setUWords - Method for add a list of identifiers for words that always start with a capital letter
- readArpa - Method for reading an ARPA file, language model
- readVocab - Method of reading the dictionary
- setEmbedding - Method for set embedding
- buildIndex - Method for build spell-checker index
- setAdCw - Method for set dictionary characteristics (cw - count all words in dataset, ad - count all documents in dataset)
- setCode - Method for set code language
- addLemma - Method for add a Lemma to the dictionary
- setNSWLibCount - Method for set the maximum number of options for analysis
Example:
>>> import asc
>>>
>>> asc.infoIndex("./wittenbell-3-single.asc")
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Code: RU
* Version: 1.0.0
* Dictionary name: Russian - single
* Locale: en_US.UTF-8
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz
* Build date: 09/08/2020 15:39:31
* Encrypted: NO
* ALM type: ALMv1
* Allow apostrophe: NO
* Count words: 106912195
* Count documents: 263998
* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES
* Count upper words: 841915
* Count pilots words: 15
* Count bad words: 108790
* Count good words: 124
* Count substitutes: 14
* Count abbreviations: 16532
* Alternatives: е => ё
* Count alternatives words: 58138
* Size embedding: 28
* Length n-gram: 3
* Count n-grams: 6710202
* Author: Yuriy Lobarev
* Contacts: site: https://anyks.com, e-mail: forman@anyks.com
* Copyright ©: Yuriy Lobarev
* License type: GPLv3
* License text:
The GNU General Public License is a free, copyleft license for software and other kinds of works.
The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.
Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and modification follow.
URL: https://www.gnu.org/licenses/gpl-3.0.ru.html
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Example:
>>> import asc
>>> import spacy
>>> import pymorphy2
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.stemming)
>>>
>>> morphRu = pymorphy2.MorphAnalyzer()
>>> morphEn = spacy.load('en', disable=['parser', 'ner'])
>>>
>>> def status(text, status):
... print(text, status)
...
>>>
>>> def eng(word):
... global morphEn
... words = morphEn(word)
... word = ''.join([token.lemma_ for token in words]).strip()
... if word[0] != '-' and word[len(word) - 1] != '-':
... return word
... else:
... return ""
...
>>>
>>> def rus(word):
... global morphRu
... if morphRu != None:
... word = morphRu.parse(word)[0].normal_form
... return word
... else:
... return ""
...
>>>
>>> def run(word, lang):
... if lang == "ru":
... return rus(word.lower())
... elif lang == "en":
... return eng(word.lower())
...
>>>
>>> asc.setStemmingMethod(run)
>>>
>>> asc.loadIndex("./wittenbell-3-single.asc", "", status)
Loading dictionary 1
Loading dictionary 2
Loading dictionary 3
Loading dictionary 4
Loading dictionary 5
Loading dictionary 6
Loading dictionary 7
Loading dictionary 8
...
Loading Bloom filter 100
Loading stemming 0
Loading stemming 1
Loading stemming 2
Loading stemming 3
...
Loading language model 6
Loading language model 12
Loading language model 18
Loading language model 25
Loading language model 31
Loading language model 37
...
Loading alternative words 1
Loading alternative words 2
Loading alternative words 3
Loading alternative words 4
Loading alternative words 5
Loading alternative words 6
Loading alternative words 7
...
Loading substitutes letters 7
Loading substitutes letters 14
Loading substitutes letters 21
Loading substitutes letters 28
Loading substitutes letters 35
Loading substitutes letters 42
...
>>>
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажёг по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажёг'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]
Example:
>>> import asc
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("Легкий", "Лёгкий")
...
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.idw("Сбербанк")
13236490857
asc.idw("Совкомбанк")
22287680895
>>>
>>> asc.token("Сбербанк")
'<word>'
>>> asc.token("совкомбанк")
'<word>'
>>>
>>> asc.setAbbrs({13236490857, 22287680895})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>>
>>> asc.token("Сбербанк")
'<abbr>'
>>> asc.token("совкомбанк")
'<abbr>'
>>> asc.token("сша")
'<abbr>'
>>> asc.token("СБЕР")
'<abbr>'
...
>>> asc.getAbbrs()
{13236490857, 189243, 22287680895, 26938511}
>>>
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
...
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
...
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
...
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusVocab(status):
... print("Read vocab", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
Read arpa 7
Read arpa 8
...
>>> asc.readVocab("./words.vocab", statusVocab)
Read vocab 0
Read vocab 1
Read vocab 2
Read vocab 3
Read vocab 4
Read vocab 5
Read vocab 6
...
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
...
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажег по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажег'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]
Example:
>>> import asc
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("зажег", "зажёг")
>>> asc.addAlt("Легкий", "Лёгкий")
...
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
...
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
...
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
...
>>> asc.idw("Москва")
50387419219
>>> asc.idw("Санкт-Петербург")
68256898625
>>>
>>> asc.setUWords({50387419219: 1, 68256898625: 1})
>>>
...
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
Read arpa 7
Read arpa 8
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
...
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажёг по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажёг'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]
Example:
>>> import asc
>>> import spacy
>>> import pymorphy2
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.stemming)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("зажег", "зажёг")
>>> asc.addAlt("Легкий", "Лёгкий")
...
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
...
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
...
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
...
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
...
>>> morphRu = pymorphy2.MorphAnalyzer()
>>> morphEn = spacy.load('en', disable=['parser', 'ner'])
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def statusStemming(status):
... print("Build stemming", status)
...
>>> def eng(word):
... global morphEn
... words = morphEn(word)
... word = ''.join([token.lemma_ for token in words]).strip()
... if word[0] != '-' and word[len(word) - 1] != '-':
... return word
... else:
... return ""
...
>>> def rus(word):
... global morphRu
... if morphRu != None:
... word = morphRu.parse(word)[0].normal_form
... return word
... else:
... return ""
...
>>> def run(word, lang):
... if lang == "ru":
... return rus(word.lower())
... elif lang == "en":
... return eng(word.lower())
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
Read arpa 7
Read arpa 8
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.setCode("ru")
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
...
>>> asc.setStemmingMethod(run)
>>>
>>> asc.buildStemming(statusStemming)
Build stemming 0
Build stemming 1
Build stemming 2
Build stemming 3
Build stemming 4
Build stemming 5
...
>>> asc.addLemma("говорил")
>>> asc.addLemma("ходить")
...
>>> asc.setNSWLibCount(50000)
>>>
>>> res = asc.spell("начальнег зажог павзрослому", True)
>>> res
('начальник зажёг по-взрослому', [('начальнег', 'начальник'), ('зажог', 'зажёг'), ('павзрослому', 'по-взрослому')])
>>>
>>> res = asc.analyze("слзы теут на мрозе")
>>> res
[('теут', ['текут']), ('мрозе', ['мозг', 'мороз', 'морозе', 'моё']), ('слзы', ['слезы', 'слёзы'])]
Methods:
- setOption - Library options setting method
- unsetOption - Disable module option method
Example:
>>> import asc
>>>
>>> asc.unsetOption(asc.options_t.debug)
>>> asc.unsetOption(asc.options_t.mixDicts)
>>> asc.unsetOption(asc.options_t.onlyGood)
>>> asc.unsetOption(asc.options_t.confidence)
...
Description
Options |
Description |
debug |
Flag debug mode |
bloom |
Flag allowed to use Bloom filter to check words |
uppers |
Flag that allows you to correct the case of letters |
stemming |
Flag for stemming activation |
onlyGood |
Flag allowing to consider words from the white list only |
mixDicts |
Flag allowing the use of words consisting of mixed dictionaries |
allowUnk |
Flag allowing to unknown word |
resetUnk |
Flag to reset the frequency of an unknown word |
allGrams |
Flag allowing accounting of all collected n-grams |
onlyTypos |
Flag to only correct typos |
lowerCase |
Flag allowing to case-insensitive |
confidence |
Flag arpa file loading without pre-processing the words |
tokenWords |
Flag that takes into account when assembling N-grams, only those tokens that match words |
interpolate |
Flag allowing to use interpolation in estimating |
ascSplit |
Flag to allow splitting of merged words |
ascAlter |
Flag that allows you to replace alternative letters in words |
ascESplit |
Flag to allow splitting of misspelled concatenated words |
ascRSplit |
Flag that allows you to combine words separated by a space |
ascUppers |
Flag that allows you to correct the case of letters |
ascHyphen |
Flag to allow splitting of concatenated words with hyphens |
ascSkipUpp |
Flag to skip uppercase words |
ascSkipLat |
Flag allowing words in the latin alphabet to be skipped |
ascSkipHyp |
Flag to skip hyphenated words |
ascWordRep |
Flag that allows you to remove duplicate words |
Methods:
- erratum - Method for search typos in text
- token - Method for determining the type of the token words
- split - Method for performing a split of clumped words
- splitByHyphens - Method for performing a split of clumped words by hyphens
- check - Method for checking a word for its existence in the dictionary
Example:
>>> import asc
>>>
>>> asc.setThreads(0)
>>> asc.setOption(asc.options_t.ascSplit)
>>> asc.setOption(asc.options_t.ascAlter)
>>> asc.setOption(asc.options_t.ascESplit)
>>> asc.setOption(asc.options_t.ascRSplit)
>>> asc.setOption(asc.options_t.ascUppers)
>>> asc.setOption(asc.options_t.ascHyphen)
>>> asc.setOption(asc.options_t.ascWordRep)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> def status(text, status):
... print(text, status)
...
>>>
>>> asc.loadIndex("./wittenbell-3-single.asc", "", status)
Loading dictionary 1
Loading dictionary 2
Loading dictionary 3
Loading dictionary 4
Loading dictionary 5
Loading dictionary 6
Loading dictionary 7
Loading dictionary 8
...
Loading Bloom filter 100
Loading stemming 100
Loading language model 6
Loading language model 12
Loading language model 18
Loading language model 25
Loading language model 31
Loading language model 37
...
Loading alternative words 1
Loading alternative words 2
Loading alternative words 3
Loading alternative words 4
Loading alternative words 5
Loading alternative words 6
Loading alternative words 7
...
Loading substitutes letters 7
Loading substitutes letters 14
Loading substitutes letters 21
Loading substitutes letters 28
Loading substitutes letters 35
Loading substitutes letters 42
...
>>>
asc.erratum("начальнег зажёг павзрослому")
['начальнег', 'павзрослому']
>>>
asc.token("word")
'<word>'
>>> asc.token("12")
'<num>'
>>> asc.token("127.0.0.1")
'<url>'
>>> asc.token("14-33")
'<range>'
>>> asc.token("14:44:22")
'<time>'
>>> asc.token("08/02/2020")
'<date>'
>>>
>>> asc.split("приветкакдела")
'привет как Дела'
>>> asc.split("былмастеромпрятатьсянонемогвоспользоватьсясвоимиталантамипотому")
'был мастером прятаться но не мог воспользоваться своими талантами потому'
>>> asc.split("Ябинатакойсоставбысходилеслиб")
'я б и на такой состав бы сходил если б'
>>> asc.split("летчерезXVIретроспективнопросматриватьэтобудет")
'лет через XVI ретроспективно просматривать это будет'
>>>
>>> asc.splitByHyphens("привет-как-дела")
'привет как дела'
>>> asc.splitByHyphens("как-то-так")
'как то так'
>>> asc.splitByHyphens("как-то")
'как-то'
>>>
>>> asc.check("hello")
True
>>> asc.check("Шварценеггер")
True
>>> asc.check("прывет")
False
Methods:
- setSize - Method for set size N-gram
- setAlmV2 - Method for set the language model type ALMv2
- unsetAlmV2 - Method for unset the language model type ALMv2
- setLocale - Method set locale (Default: en_US.UTF-8)
- setCode - Method for set code language
- setLictype - Method for set dictionary license information type
- setName - Method for set dictionary name
- setAuthor - Method for set the dictionary author
- setCopyright - Method for set copyright on a dictionary
- setLictext - Method for set license information dictionary
- setContacts - Method for set contact details of the dictionary author
- pruneArpa - Language model pruning method
- addWord - Method for add a word to the dictionary
- generateEmbedding - Method for generation embedding
- setSizeEmbedding - Method for set the embedding size
Description
Smoothing |
wittenBell |
addSmooth |
goodTuring |
constDiscount |
naturalDiscount |
kneserNey |
modKneserNey |
Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setAlmV2()
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("ежик", "ёжик")
>>> asc.addAlt("зажег", "зажёг")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusMap(status):
... print("Write map", status)
...
>>> def statusArpa1(status):
... print("Build arpa", status)
...
>>> def statusArpa2(status):
... print("Write arpa", status)
...
>>> def statusWords(status):
... print("Write words", status)
...
>>> def statusVocab(status):
... print("Write vocab", status)
...
>>> def statusAbbrs(status):
... print("Write abbrs", status)
...
>>> def statusPrune(status):
... print("Prune vocab", status)
...
>>> def statusNgram(status):
... print("Write ngram", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.addText("The future is now", 0)
>>>
>>> asc.collectCorpus("./corpus/text.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)
Read text corpora 0
Read text corpora 1
Read text corpora 2
Read text corpora 3
Read text corpora 4
Read text corpora 5
Read text corpora 6
...
>>> asc.pruneVocab(-15.0, 0, 0, statusPrune)
Prune vocab 0
Prune vocab 1
Prune vocab 2
Prune vocab 3
Prune vocab 4
Prune vocab 5
Prune vocab 6
...
# Prune VOCAB or prune ARPA example
>>> asc.pruneArpa(0.015, 3, statusPrune)
Prune arpa 0
Prune arpa 1
Prune arpa 2
Prune arpa 3
Prune arpa 4
Prune arpa 5
Prune arpa 6
...
>>> asc.buildArpa(statusArpa1)
Build arpa 0
Build arpa 1
Build arpa 2
Build arpa 3
Build arpa 4
Build arpa 5
Build arpa 6
...
>>> asc.writeMap("./words.map", statusMap)
Write map 0
Write map 1
Write map 2
Write map 3
Write map 4
Write map 5
Write map 6
...
>>> asc.writeArpa("./words.arpa", statusArpa2)
Write arpa 0
Write arpa 1
Write arpa 2
Write arpa 3
Write arpa 4
Write arpa 5
Write arpa 6
...
>>> asc.writeWords("./words.txt", statusWords)
Write words 0
Write words 1
Write words 2
Write words 3
Write words 4
Write words 5
Write words 6
...
>>> asc.writeVocab("./words.vocab", statusVocab)
Write vocab 0
Write vocab 1
Write vocab 2
Write vocab 3
Write vocab 4
Write vocab 5
Write vocab 6
...
>>> asc.writeAbbrs("./words1.abbr", statusAbbrs)
Write abbrs 50
Write abbrs 100
>>>
>>> asc.writeSuffix("./words2.abbr", statusAbbrs)
Write abbrs 10
Write abbrs 20
Write abbrs 30
Write abbrs 40
Write abbrs 50
Write abbrs 60
...
>>> asc.writeNgrams("./words.ngram", statusNgram)
Write ngram 0
Write ngram 1
Write ngram 2
Write ngram 3
Write ngram 4
Write ngram 5
Write ngram 6
...
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: info@example.com")
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>> asc.saveIndex("./3-wittenbell.asc", "", 128, status)
Read words 1
Read words 2
Read words 3
Read words 4
Read words 5
Read words 6
...
Train dictionary 0
Train dictionary 1
Train dictionary 2
Train dictionary 3
Train dictionary 4
Train dictionary 5
Train dictionary 6
...
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Code: RU
* Version: 1.0.0
* Dictionary name: Russian
* Locale: en_US.UTF-8
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz
* Build date: 09/14/2020 01:39:50
* Encrypted: NO
* ALM type: ALMv2
* Allow apostrophe: NO
* Count words: 38120
* Count documents: 13
* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES
* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 12
* Alternatives: е => ё
* Count alternatives words: 1
* Size embedding: 28
* Length n-gram: 1
* Author: You name
* Contacts: site: https://example.com, e-mail: info@example.com
* Copyright ©: You company LLC
* License type: MIT
* License text:
... License text ...
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusVocab(status):
... print("Read vocab", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
...
>>> asc.readVocab("./words.vocab", statusVocab)
Read vocab 0
Read vocab 1
Read vocab 2
Read vocab 3
Read vocab 4
Read vocab 5
Read vocab 6
...
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: info@example.com")
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
Build index 5
Build index 6
...
>>> asc.saveIndex("./3-wittenbell.asc", "", 128, status)
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Code: RU
* Version: 1.0.0
* Dictionary name: Russian
* Locale: en_US.UTF-8
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz
* Build date: 09/14/2020 01:58:52
* Encrypted: NO
* ALM type: ALMv1
* Allow apostrophe: NO
* Count words: 38120
* Count documents: 13
* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES
* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 2
* Alternatives: е => ё
* Count alternatives words: 1
* Size embedding: 28
* Length n-gram: 3
* Count n-grams: 353
* Author: You name
* Contacts: site: https://example.com, e-mail: info@example.com
* Copyright ©: You company LLC
* License type: MIT
* License text:
... License text ...
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setAlmV2()
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def statusPrune(status):
... print("Prune arpa", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.addWord("министерство")
>>> asc.addWord("возмездие", 0, 1)
>>> asc.addWord("возражение", asc.idw("возражение"), 2)
...
>>>
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: info@example.com")
>>>
>>> asc.setEmbedding({
... "а": 0, "б": 1, "в": 2, "г": 3, "д": 4, "е": 5,
... "ё": 5, "ж": 6, "з": 7, "и": 8, "й": 8, "к": 9,
... "л": 10, "м": 11, "н": 12, "о": 0, "п": 13, "р": 14,
... "с": 15, "т": 16, "у": 17, "ф": 18, "х": 19, "ц": 20,
... "ч": 21, "ш": 21, "щ": 21, "ъ": 22, "ы": 23, "ь": 22,
... "э": 5, "ю": 24, "я": 25, "<": 26, ">": 26, "~": 26,
... "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
... "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
... "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
... "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
... "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
... "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
... "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
... "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
... }, 28)
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
Build index 5
Build index 6
...
>>> asc.saveIndex("./3-wittenbell.asc", "password", 128, status)
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Code: RU
* Version: 1.0.0
* Dictionary name: Russian
* Build date: 09/14/2020 02:09:38
* Encrypted: YES
* ALM type: ALMv2
* Allow apostrophe: NO
* Count words: 38120
* Count documents: 13
* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES
* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 2
* Alternatives: е => ё
* Count alternatives words: 1
* Size embedding: 28
* Length n-gram: 3
* Count n-grams: 353
* Author: You name
* Contacts: site: https://example.com, e-mail: info@example.com
* Copyright ©: You company LLC
* License type: MIT
* License text:
... License text ...
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Example:
>>> import asc
>>>
>>> asc.setSize(3)
>>> asc.setAlmV2()
>>> asc.setThreads(0)
>>> asc.setLocale("en_US.UTF-8")
>>>
>>> asc.setOption(asc.options_t.allowUnk)
>>> asc.setOption(asc.options_t.resetUnk)
>>> asc.setOption(asc.options_t.mixDicts)
>>> asc.setOption(asc.options_t.tokenWords)
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setOption(asc.options_t.interpolate)
>>>
>>> asc.addAlt("е", "ё")
>>> asc.addAlt("Легкий", "Лёгкий")
>>>
>>> asc.setAlphabet("абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz")
>>>
>>> asc.setPilots(["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"])
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.addAbbr("США")
>>> asc.addAbbr("Сбер")
>>> asc.addGoodword("T-34")
>>> asc.addGoodword("АН-25")
>>>
>>> asc.addBadword("ийти")
>>> asc.addBadword("циган")
>>> asc.addBadword("апичатка")
>>>
>>> asc.addUWord("Москва")
>>> asc.addUWord("Санкт-Петербург")
>>>
>>> def statusArpa(status):
... print("Read arpa", status)
...
>>> def statusIndex(status):
... print("Build index", status)
...
>>> def statusPrune(status):
... print("Prune arpa", status)
...
>>> def status(text, status):
... print(text, status)
...
>>> asc.readArpa("./words.arpa", statusArpa)
Read arpa 0
Read arpa 1
Read arpa 2
Read arpa 3
Read arpa 4
Read arpa 5
Read arpa 6
...
>>> asc.setAdCw(38120, 13)
>>>
>>> asc.addWord("министерство")
>>> asc.addWord("возмездие", 0, 1)
>>> asc.addWord("возражение", asc.idw("возражение"), 2)
...
>>>
>>> asc.setCode("RU")
>>> asc.setLictype("MIT")
>>> asc.setName("Russian")
>>> asc.setAuthor("You name")
>>> asc.setCopyright("You company LLC")
>>> asc.setLictext("... License text ...")
>>> asc.setContacts("site: https://example.com, e-mail: info@example.com")
>>>
>>> asc.setSizeEmbedding(32)
>>> asc.generateEmbedding()
>>>
>>> asc.buildIndex(statusIndex)
Build index 0
Build index 1
Build index 2
Build index 3
Build index 4
Build index 5
Build index 6
...
>>> asc.saveIndex("./3-wittenbell.asc", "password", 128, status)
Dump dictionary 0
Dump dictionary 1
Dump dictionary 2
Dump dictionary 3
Dump dictionary 4
Dump dictionary 5
Dump dictionary 6
...
Dump alternative letters 100
Dump alternative letters 100
Dump alternative words 200
Dump alternative words 100
Dump language model 0
Dump language model 100
Dump substitutes letters 9
Dump substitutes letters 18
Dump substitutes letters 27
Dump substitutes letters 36
Dump substitutes letters 45
Dump substitutes letters 54
Dump substitutes letters 63
Dump substitutes letters 72
Dump substitutes letters 81
Dump substitutes letters 90
Dump substitutes letters 100
Dump substitutes letters 100
>>>
>>> asc.infoIndex("./3-wittenbell.asc")
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Code: RU
* Version: 1.0.0
* Dictionary name: Russian
* Build date: 09/14/2020 02:09:38
* Encrypted: YES
* ALM type: ALMv2
* Allow apostrophe: NO
* Count words: 38120
* Count documents: 13
* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES
* Count upper words: 2
* Count pilots words: 15
* Count bad words: 3
* Count good words: 2
* Count substitutes: 11
* Count abbreviations: 2
* Alternatives: е => ё
* Count alternatives words: 1
* Size embedding: 28
* Length n-gram: 3
* Count n-grams: 353
* Author: You name
* Contacts: site: https://example.com, e-mail: info@example.com
* Copyright ©: You company LLC
* License type: MIT
* License text:
... License text ...
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Methods:
- size - Method of obtaining the size of the N-gram
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.size()
3
Methods:
- damerauLevenshtein - Determination of the Damerau-Levenshtein distance in phrases
- distanceLevenshtein - Determination of Levenshtein distance in phrases
- tanimoto - Method for determining Jaccard coefficient (quotient - Tanimoto coefficient)
- needlemanWunsch - Word stretching method
Example:
>>> import asc
>>> asc.damerauLevenshtein("привет", "приветик")
2
>>>
>>> asc.damerauLevenshtein("приевтик", "приветик")
1
>>>
>>> asc.distanceLevenshtein("приевтик", "приветик")
2
>>>
>>> asc.tanimoto("привет", "приветик")
0.7142857142857143
>>>
>>> asc.tanimoto("привеитк", "приветик")
0.4
>>>
>>> asc.needlemanWunsch("привеитк", "приветик")
4
>>>
>>> asc.needlemanWunsch("привет", "приветик")
2
>>>
>>> asc.damerauLevenshtein("acre", "car")
2
>>> asc.distanceLevenshtein("acre", "car")
3
>>>
>>> asc.damerauLevenshtein("anteater", "theatre")
4
>>> asc.distanceLevenshtein("anteater", "theatre")
5
>>>
>>> asc.damerauLevenshtein("banana", "nanny")
3
>>> asc.distanceLevenshtein("banana", "nanny")
3
>>>
>>> asc.damerauLevenshtein("cat", "crate")
2
>>> asc.distanceLevenshtein("cat", "crate")
2
>>>
>>> asc.mulctLevenshtein("привет", "приветик")
4
>>>
>>> asc.mulctLevenshtein("приевтик", "приветик")
1
>>>
>>> asc.mulctLevenshtein("acre", "car")
3
>>>
>>> asc.mulctLevenshtein("anteater", "theatre")
5
>>>
>>> asc.mulctLevenshtein("banana", "nanny")
4
>>>
>>> asc.mulctLevenshtein("cat", "crate")
4
Methods:
- textToJson - Method to convert text to JSON
- isAllowApostrophe - Apostrophe permission check method
- switchAllowApostrophe - Method for permitting or denying an apostrophe as part of a word
Example:
>>> import asc
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.isAllowApostrophe()
False
>>> asc.switchAllowApostrophe()
>>>
>>> asc.isAllowApostrophe()
True
>>> asc.textToJson("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", callbackFn)
[["«","On","nous","dit","qu'aujourd'hui","c'est","le","cas",",","encore","faudra-t-il","l'évaluer","»","l'astronomie"]]
Methods:
- jsonToText - Method to convert JSON to text
Example:
>>> import asc
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.jsonToText('[["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"]]', callbackFn)
«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie
Methods:
- restore - Method for restore text from context
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.uppers)
>>>
>>> asc.restore(["«","On","nous","dit","qu\'aujourd\'hui","c\'est","le","cas",",","encore","faudra-t-il","l\'évaluer","»","l\'astronomie"])
"«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie"
Methods:
- allowStress - Method for allow using stress in words
- disallowStress - Method for disallow using stress in words
Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> asc.jsonToText('[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Белая стрела» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
>>>
>>> asc.allowStress()
>>> asc.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Бе́лая","стрела́","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> asc.jsonToText('[["«","Бе́лая","стрела́","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
>>>
>>> asc.disallowStress()
>>> asc.textToJson('«Бе́лая стрела́» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами[1][2][3]. Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности[4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы[5].', callbackFn)
[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]
>>>
>>> asc.jsonToText('[["«","Белая","стрела","»","—","согласно","распространённой","в","1990-е","годы","в","России","городской","легенде",",","якобы","специально","организованная","и","подготовленная","законспирированная","правительственная","спецслужба",",","сотрудники","которой","—","бывшие","и","действовавшие","милиционеры","и","спецназовцы",",","имеющие","право","на","физическую","ликвидацию","особо","опасных","уголовных","авторитетов","и","лидеров","орудовавших","в","России","ОПГ",",","относительно","которых","не","представляется","возможным","привлечения","их","к","уголовной","ответственности","законными","методами","[","1","]","[","2","]","[","3","]","."],["Несмотря","на","отсутствие","официальных","доказательств","существования","организации","и","многочисленные","опровержения","со","стороны","силовых","структур","и","служб","безопасности","[","4","]",",","в","российском","обществе","легенду","считают","основанной","на","подлинных","фактах","громких","убийств","криминальных","авторитетов",",","совершённых","в","1990-е","годы",",","и","не","исключают","существование","реальной","спецслужбы","[","5","]","."]]', callbackFn)
«Белая стрела» — согласно распространённой в 1990-е годы в России городской легенде, якобы специально организованная и подготовленная законспирированная правительственная спецслужба, сотрудники которой — бывшие и действовавшие милиционеры и спецназовцы, имеющие право на физическую ликвидацию особо опасных уголовных авторитетов и лидеров орудовавших в России ОПГ, относительно которых не представляется возможным привлечения их к уголовной ответственности законными методами [1] [2] [3].
Несмотря на отсутствие официальных доказательств существования организации и многочисленные опровержения со стороны силовых структур и служб безопасности [4], в российском обществе легенду считают основанной на подлинных фактах громких убийств криминальных авторитетов, совершённых в 1990-е годы, и не исключают существование реальной спецслужбы [5].
Methods:
- addBadword - Method add bad word
- setBadwords - Method set words to blacklist
- getBadwords - Method get words in blacklist
Example:
>>> import asc
>>>
>>> asc.setBadwords(["hello", "world", "test"])
>>>
>>> asc.getBadwords()
{1554834897, 2156498622, 28307030}
>>>
>>> asc.addBadword("test2")
>>>
>>> asc.getBadwords()
{5170183734, 1554834897, 2156498622, 28307030}
Example:
>>> import asc
>>>
>>> asc.setBadwords({24227504, 1219922507, 1794085167})
>>>
>>> asc.getBadwords()
{24227504, 1219922507, 1794085167}
>>>
>>> asc.clear(asc.clear_t.badwords)
>>>
>>> asc.getBadwords()
{}
Methods:
- addGoodword - Method add good word
- setGoodwords - Method set words to whitelist
- getGoodwords - Method get words in whitelist
Example:
>>> import asc
>>>
>>> asc.setGoodwords(["hello", "world", "test"])
>>>
>>> asc.getGoodwords()
{1554834897, 2156498622, 28307030}
>>>
>>> asc.addGoodword("test2")
>>>
>>> asc.getGoodwords()
{5170183734, 1554834897, 2156498622, 28307030}
>>>
>>> asc.clear(asc.clear_t.goodwords)
>>>
>> asc.getGoodwords()
{}
Example:
>>> import asc
>>>
>>> asc.setGoodwords({24227504, 1219922507, 1794085167})
>>>
>>> asc.getGoodwords()
{24227504, 1219922507, 1794085167}
Methods:
- setUserToken - Method for adding user token
- getUserTokens - User token list retrieval method
- getUserTokenId - Method for obtaining user token identifier
- getUserTokenWord - Method for obtaining a custom token by its identifier
Example:
>>> import asc
>>>
>>> asc.setUserToken("usa")
>>>
>>> asc.setUserToken("russia")
>>>
>>> asc.getUserTokenId("usa")
5759809081
>>>
>>> asc.getUserTokenId("russia")
9910674734
>>>
>>> asc.getUserTokens()
['usa', 'russia']
>>>
>>> asc.getUserTokenWord(5759809081)
'usa'
>>>
>>> asc.getUserTokenWord(9910674734)
'russia'
>>>
>> asc.clear(asc.clear_t.utokens)
>>>
>>> asc.getUserTokens()
[]
Methods:
- findNgram - N-gram search method in text
- word - "Method to extract a word by its identifier"
Example:
>>> import asc
>>>
>>> def callbackFn(text):
... print(text)
...
>>> asc.setOption(asc.options_t.confidence)
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.idw("привет")
2487910648
>>> asc.word(2487910648)
'привет'
>>>
>>> asc.findNgram("Особое место занимает чудотворная икона Лобзание Христа Иудою", callbackFn)
<s> Особое
Особое место
место занимает
занимает чудотворная
чудотворная икона
икона Лобзание
Лобзание Христа
Христа Иудою
Иудою </s>
>>>
Methods:
- setUserTokenMethod - Method for set a custom token processing function
Example:
>>> import asc
>>>
>>> def fn(token, word):
... if token and (token == "<usa>"):
... if word and (word.lower() == "usa"):
... return True
... elif token and (token == "<russia>"):
... if word and (word.lower() == "russia"):
... return True
... return False
...
>>> asc.setUserToken("usa")
>>>
>>> asc.setUserToken("russia")
>>>
>>> asc.setUserTokenMethod("usa", fn)
>>>
>>> asc.setUserTokenMethod("russia", fn)
>>>
>>> asc.idw("usa")
5759809081
>>>
>>> asc.idw("russia")
9910674734
>>>
>>> asc.getUserTokenWord(5759809081)
'usa'
>>>
>>> asc.getUserTokenWord(9910674734)
'russia'
Methods:
- setWordPreprocessingMethod - Method for set the word preprocessing function
Example:
>>> import asc
>>>
>>> def run(word, context):
... if word == "возле": word = "около"
... return word
...
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.setWordPreprocessingMethod(run)
>>>
>>> a = asc.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
info: <s> Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор <punct> <punct> <punct> </s>
info: p( неожиданно | <s> ) = [2gram] 0.00038931 [ -3.40969900 ] / 0.99999991
info: p( из | неожиданно ...) = [2gram] 0.10110741 [ -0.99521700 ] / 0.99999979
info: p( подворотни | из ...) = [2gram] 0.00711798 [ -2.14764300 ] / 1.00000027
info: p( в | подворотни ...) = [2gram] 0.51077661 [ -0.29176900 ] / 1.00000021
info: p( олега | в ...) = [2gram] 0.00082936 [ -3.08125500 ] / 0.99999974
info: p( ударил | олега ...) = [2gram] 0.25002820 [ -0.60201100 ] / 0.99999978
info: p( яркий | ударил ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( прожектор | яркий ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( патрульный | прожектор ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( трактор | патрульный ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( <punct> | трактор ...) = [OOV] 0.00000000 [ -inf ] / 0.99999973
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993
info: 1 sentences, 13 words, 0 OOVs
info: 3 zeroprobs, logprob= -12.97624000 ppl= 8.45034200 ppl1= 9.95800426
info: <s> С лязгом выкатился и остановился около мальчика <punct> <punct> <punct> <punct> </s>
info: p( с | <s> ) = [2gram] 0.00642448 [ -2.19216200 ] / 0.99999991
info: p( лязгом | с ...) = [2gram] 0.00195917 [ -2.70792700 ] / 0.99999999
info: p( выкатился | лязгом ...) = [2gram] 0.50002878 [ -0.30100500 ] / 1.00000034
info: p( и | выкатился ...) = [2gram] 0.51169951 [ -0.29098500 ] / 1.00000024
info: p( остановился | и ...) = [2gram] 0.00143382 [ -2.84350600 ] / 0.99999975
info: p( около | остановился ...) = [1gram] 0.00011358 [ -3.94468000 ] / 1.00000003
info: p( мальчика | около ...) = [1gram] 0.00003932 [ -4.40541100 ] / 1.00000016
info: p( <punct> | мальчика ...) = [OOV] 0.00000000 [ -inf ] / 0.99999990
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( <punct> | <punct> ...) = [OOV] 0.00000000 [ -inf ] / 0.99999993
info: p( </s> | <punct> ...) = [1gram] 0.05693430 [ -1.24462600 ] / 0.99999993
info: 1 sentences, 11 words, 0 OOVs
info: 4 zeroprobs, logprob= -17.93030200 ppl= 31.20267541 ppl1= 42.66064865
>>> print(a.logprob)
-30.906542
Methods:
- setLogfile - Method of set the file for log output
- setOOvFile - Method set file for saving OOVs words
Example:
>>> import asc
>>>
>>> asc.setLogfile("./log.txt")
>>>
>>> asc.setOOvFile("./oov.txt")
Methods:
- perplexity - Perplexity calculation
- pplConcatenate - Method of combining perplexia
- pplByFiles - Method for reading perplexity calculation by file or group of files
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> a = asc.perplexity("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
>>>
>>> print(a.logprob)
-30.906542
>>>
>>> print(a.oovs)
0
>>>
>>> print(a.words)
24
>>>
>>> print(a.sentences)
2
>>>
>>> print(a.zeroprobs)
7
>>>
>>> print(a.ppl)
17.229063831108224
>>>
>>> print(a.ppl1)
19.398698060810077
>>>
>>> b = asc.pplByFiles("./text.txt")
>>>
>>> c = asc.pplConcatenate(a, b)
>>>
>>> print(c.ppl)
7.384123548831112
Description
Name |
Description |
ppl |
The meaning of perplexity without considering the beginning of the sentence |
ppl1 |
The meaning of perplexion taking into account the beginning of the sentence |
oovs |
Count of oov words |
words |
Count of words in sentence |
logprob |
Word sequence frequency |
sentences |
Count of sequences |
zeroprobs |
Count of zero probs |
Methods:
- tokenization - Method for breaking text into tokens
Example:
>>> import asc
>>>
>>> def tokensFn(word, context, reset, stop):
... print(word, " => ", context)
... return True
...
>>> asc.switchAllowApostrophe()
>>>
>>> asc.tokenization("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", tokensFn)
« => []
On => ['«']
nous => ['«', 'On']
dit => ['«', 'On', 'nous']
qu'aujourd'hui => ['«', 'On', 'nous', 'dit']
c'est => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui"]
le => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est"]
cas => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le']
, => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas']
encore => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',']
faudra-t-il => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore']
l => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l']
' => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l']
évaluer => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'"]
» => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer']
l => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»']
' => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»', 'l']
astronomie => ['«', 'On', 'nous', 'dit', "qu'aujourd'hui", "c'est", 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', "'", 'évaluer', '»', 'l', "'"]
Methods:
- setTokenizerFn - Method for set the function of an external tokenizer
Example:
>>> import asc
>>>
>>> def tokenizerFn(text, callback):
... word = ""
... context = []
... for letter in text:
... if letter == " " and len(word) > 0:
... if not callback(word, context, False, False): return
... context.append(word)
... word = ""
... elif letter == "." or letter == "!" or letter == "?":
... if not callback(word, context, True, False): return
... word = ""
... context = []
... else:
... word += letter
... if len(word) > 0:
... if not callback(word, context, False, True): return
...
>>> def tokensFn(word, context, reset, stop):
... print(word, " => ", context)
... return True
...
>>> asc.setTokenizerFn(tokenizerFn)
>>>
>>> asc.tokenization("Hello World today!", tokensFn)
Hello => []
World => ['Hello']
today => ['Hello', 'World']
Methods:
- sentences - Sentences generation method
- sentencesToFile - Method for assembling a specified number of sentences and writing to a file
Example:
>>> import asc
>>>
>>> def sentencesFn(text):
... print("Sentences:", text)
... return True
...
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.sentences(sentencesFn)
Sentences: <s> В общем </s>
Sentences: <s> С лязгом выкатился и остановился возле мальчика </s>
Sentences: <s> У меня нет </s>
Sentences: <s> Я вообще не хочу </s>
Sentences: <s> Да и в общем </s>
Sentences: <s> Не могу </s>
Sentences: <s> Ну в общем </s>
Sentences: <s> Так что я вообще не хочу </s>
Sentences: <s> Потому что я вообще не хочу </s>
Sentences: <s> Продолжение следует </s>
Sentences: <s> Неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор </s>
>>>
>>> asc.sentencesToFile(5, "./result.txt")
Methods:
- fixUppers - Method for correcting registers in the text
- fixUppersByFiles - Method for correcting text registers in a text file
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.fixUppers("неожиданно из подворотни в олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
'Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? С лязгом выкатился и остановился возле мальчика....'
>>>
>>> asc.fixUppersByFiles("./corpus", "./result.txt", "txt")
Methods:
- checkHypLat - Hyphen and latin character search method
Example:
>>> import asc
>>>
>>> asc.checkHypLat("Hello-World")
(True, True)
>>>
>>> asc.checkHypLat("Hello")
(False, True)
>>>
>>> asc.checkHypLat("Привет")
(False, False)
>>>
>>> asc.checkHypLat("так-как")
(True, False)
Methods:
- getUppers - Method for extracting registers for each word
- countLetter - Method for counting the amount of a specific letter in a word
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.idw("Living")
10493385932
>>>
>>> asc.idw("in")
3301
>>>
>>> asc.idw("the")
217280
>>>
>>> asc.idw("USA")
188643
>>>
>>> asc.getUppers([10493385932, 3301, 217280, 188643])
[1, 0, 0, 7]
>>>
>>> asc.countLetter("hello-world", "-")
1
>>>
>>> asc.countLetter("hello-world", "l")
3
Methods:
- urls - Method for extracting URL address coordinates in a string
Example:
>>> import asc
>>>
>>> asc.urls("This website: example.com was designed with ...")
{14: 25}
>>>
>>> asc.urls("This website: https://a.b.c.example.net?id=52#test-1 was designed with ...")
{14: 52}
>>>
>>> asc.urls("This website: https://a.b.c.example.net?id=52#test-1 and 127.0.0.1 was designed with ...")
{14: 52, 57: 66}
Methods:
- roman2Arabic - Method for translating Roman numerals to Arabic
Example:
>>> import asc
>>>
>>> asc.roman2Arabic("XVI")
16
Methods:
- rest - Method for correction and detection of words with mixed alphabets
- setSubstitutes - Method for set letters to correct words from mixed alphabets
- getSubstitutes - Method of extracting letters to correct words from mixed alphabets
Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.getSubstitutes()
{'a': 'а', 'b': 'в', 'c': 'с', 'e': 'е', 'h': 'н', 'k': 'к', 'm': 'м', 'o': 'о', 'p': 'р', 't': 'т', 'x': 'х'}
>>>
>>> str = "ПPИBETИК"
>>>
>>> str.lower()
'пpиbetик'
>>>
>>> asc.rest(str)
'приветик'
Methods:
- setTokensDisable - Method for set the list of forbidden tokens
- setTokensUnknown - Method for set the list of tokens cast to 〈unk〉
- setTokenDisable - Method for set the list of unidentifiable tokens
- setTokenUnknown - Method of set the list of tokens that need to be identified as 〈unk〉
- getTokensDisable - Method for retrieving the list of forbidden tokens
- getTokensUnknown - Method for extracting a list of tokens reducible to 〈unk〉
- setAllTokenDisable - Method for set all tokens as unidentifiable
- setAllTokenUnknown - The method of set all tokens identified as 〈unk〉
Example:
>>> import asc
>>>
>>> asc.idw("<date>")
6
>>>
>>> asc.idw("<time>")
7
>>>
>>> asc.idw("<abbr>")
5
>>>
>>> asc.idw("<math>")
9
>>>
>>> asc.setTokenDisable("date|time|abbr|math")
>>>
>>> asc.getTokensDisable()
{9, 5, 6, 7}
>>>
>>> asc.setTokensDisable({6, 7, 5, 9})
>>>
>>> asc.setTokenUnknown("date|time|abbr|math")
>>>
>>> asc.getTokensUnknown()
{9, 5, 6, 7}
>>>
>>> asc.setTokensUnknown({6, 7, 5, 9})
>>>
>>> asc.setAllTokenDisable()
>>>
>>> asc.getTokensDisable()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23}
>>>
>>> asc.setAllTokenUnknown()
>>>
>>> asc.getTokensUnknown()
{2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23}
Methods:
- countAlphabet - Method of obtaining the number of letters in the dictionary
Example:
>>> import asc
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>>
>>> asc.countAlphabet()
26
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.countAlphabet()
59
Methods:
- countBigrams - Method get count bigrams
- countTrigrams - Method get count trigrams
- countGrams - Method get count N-gram by lm size
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.countBigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
12
>>>
>>> asc.countTrigrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>>
>>> asc.size()
3
>>>
>>> asc.countGrams("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....")
10
>>>
>>> asc.idw("неожиданно")
3263936167
>>>
>>> asc.idw("из")
5134
>>>
>>> asc.idw("подворотни")
12535356101
>>>
>>> asc.idw("в")
53
>>>
>>> asc.idw("Олега")
2824508300
>>>
>>> asc.idw("ударил")
24816796913
>>>
>>> asc.countBigrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
5
>>>
>>> asc.countTrigrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
4
>>>
>>> asc.countGrams([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
4
Methods:
- arabic2Roman - Convert arabic number to roman number
Example:
>>> import asc
>>>
>>> asc.arabic2Roman(23)
'XXIII'
>>>
>>> asc.arabic2Roman("33")
'XXXIII'
Methods:
- setThreads - Method for set the number of threads (0 - all threads)
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.setThreads(3)
>>>
>>> a = asc.pplByFiles("./text.txt")
>>>
>>> print(a.logprob)
-48201.29481399994
Methods:
- fti - Method for removing the fractional part of a number
Example:
>>> import asc
>>>
>>> asc.fti(5892.4892)
5892489200000
>>>
>>> asc.fti(5892.4892, 4)
58924892
Methods:
- context - Method for assembling text context from a sequence
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.idw("неожиданно")
3263936167
>>>
>>> asc.idw("из")
5134
>>>
>>> asc.idw("подворотни")
12535356101
>>>
>>> asc.idw("в")
53
>>>
>>> asc.idw("Олега")
2824508300
>>>
>>> asc.idw("ударил")
24816796913
>>>
>>> asc.context([3263936167, 5134, 12535356101, 53, 2824508300, 24816796913])
'Неожиданно из подворотни в Олега ударил'
Methods:
- isAbbr - Method of checking a word for compliance with an abbreviation
- isSuffix - Method for checking a word for a suffix of a numeric abbreviation
- isToken - Method for checking if an identifier matches a token
- isIdWord - Method for checking if an identifier matches a word
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.addAbbr("США")
>>>
>>> asc.isAbbr("сша")
True
>>>
>>> asc.addSuffix("1-я")
>>>
>>> asc.isSuffix("1-я")
True
>>>
>>> asc.isToken(asc.idw("США"))
True
>>>
>>> asc.isToken(asc.idw("1-я"))
True
>>>
>>> asc.isToken(asc.idw("125"))
True
>>>
>>> asc.isToken(asc.idw("<s>"))
True
>>>
>>> asc.isToken(asc.idw("Hello"))
False
>>>
>>> asc.isIdWord(asc.idw("https://anyks.com"))
True
>>>
>>> asc.isIdWord(asc.idw("Hello"))
True
>>>
>>> asc.isIdWord(asc.idw("-"))
False
Methods:
- findByFiles - Method search N-grams in a text file
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.findByFiles("./text.txt", "./result.txt")
info: <s> Кукай
сари кукай
сари японские
японские каллиграфы
каллиграфы я
я постоянно
постоянно навещал
навещал их
их тайно
тайно от
от людей
людей </s>
info: <s> Неожиданно из
Неожиданно из подворотни
из подворотни в
подворотни в Олега
в Олега ударил
Олега ударил яркий
ударил яркий прожектор
яркий прожектор патрульный
прожектор патрульный трактор
патрульный трактор
<s> С лязгом
С лязгом выкатился
лязгом выкатился и
выкатился и остановился
и остановился возле
остановился возле мальчика
возле мальчика
Methods:
- checkSequence - Sequence Existence Method
- existSequence - Method for checking the existence of a sequence, excluding non-word tokens
- checkByFiles - Method for checking if a sequence exists in a text file
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.addAbbr("США")
>>>
>>> asc.isAbbr("сша")
>>>
>>> asc.checkSequence("Неожиданно из подворотни в олега ударил")
True
>>>
>>> asc.checkSequence("Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором")
True
>>>
>>> asc.checkSequence("Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором", True)
True
>>>
>>> asc.checkSequence("в Олега ударил яркий")
True
>>>
>>> asc.checkSequence("в Олега ударил яркий", True)
True
>>>
>>> asc.checkSequence("от госсекретаря США")
True
>>>
>>> asc.checkSequence("от госсекретаря США", True)
True
>>>
>>> asc.checkSequence("Неожиданно из подворотни в олега ударил", 2)
True
>>>
>>> asc.checkSequence(["Неожиданно","из","подворотни","в","олега","ударил"], 2)
True
>>>
>>> asc.existSequence("<s> Сегодня сыграл и в, Олега ударил яркий прожектор, патрульный трактор - с корпоративным сектором </s>", 2)
(True, 0)
>>>
>>> asc.existSequence(["<s>","Сегодня","сыграл","и","в",",","Олега","ударил","яркий","прожектор",",","патрульный","трактор","-","с","корпоративным","сектором","</s>"], 2)
(True, 2)
>>>
>>> asc.idw("от")
6086
>>>
>>> asc.idw("госсекретаря")
51273912082
>>>
>>> asc.idw("США")
5
>>>
>>> asc.checkSequence([6086, 51273912082, 5])
True
>>>
>>> asc.checkSequence([6086, 51273912082, 5], True)
True
>>>
>>> asc.checkSequence(["от", "госсекретаря", "США"])
True
>>>
>>> asc.checkSequence(["от", "госсекретаря", "США"], True)
True
>>>
>>> asc.checkByFiles("./text.txt", "./result.txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались
info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>>
>>> asc.checkByFiles("./corpus", "./result.txt", False, "txt")
info: 1999 | YES | Какой-то период времени мы вообще не общались
info: 2000 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2001 | YES | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2002 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2004 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2005 | YES | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 1359
Not exists texts: 648
>>>
>>> asc.checkByFiles("./corpus", "./result.txt", True, "txt")
info: 2000 | NO | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 2001 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.С лязгом выкатился и остановился возле мальчика.
info: 2002 | NO | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 2003 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 2004 | NO | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 2005 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 2006 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 2007 | NO | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
All texts: 2007
Exists texts: 0
Not exists texts: 2007
Methods:
- check - String Check Method
- match - String Matching Method
- addAbbr - Method add abbreviation
- addSuffix - Method add number suffix abbreviation
- setSuffixes - Method set number suffix abbreviations
- readSuffix - Method for reading data from a file of suffixes and abbreviations
Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>> asc.setSubstitutes({'p':'р','c':'с','o':'о','t':'т','k':'к','e':'е','a':'а','h':'н','x':'х','b':'в','m':'м'})
>>>
>>> asc.check("Дом-2", asc.check_t.home2)
True
>>>
>>> asc.check("Дом2", asc.check_t.home2)
False
>>>
>>> asc.check("Дом-2", asc.check_t.latian)
False
>>>
>>> asc.check("Hello", asc.check_t.latian)
True
>>>
>>> asc.check("прiвет", asc.check_t.latian)
True
>>>
>>> asc.check("Дом-2", asc.check_t.hyphen)
True
>>>
>>> asc.check("Дом2", asc.check_t.hyphen)
False
>>>
>>> asc.check("Д", asc.check_t.letter)
True
>>>
>>> asc.check("$", asc.check_t.letter)
False
>>>
>>> asc.check("-", asc.check_t.letter)
False
>>>
>>> asc.check("просtоквaшино", asc.check_t.similars)
True
>>>
>>> asc.match("my site http://example.ru, it's true", asc.match_t.url)
True
>>>
>>> asc.match("по вашему ip адресу 46.40.123.12 проводится проверка", asc.match_t.url)
True
>>>
>>> asc.match("мой адрес в формате IPv6: http://[2001:0db8:11a3:09d7:1f34:8a2e:07a0:765d]/", asc.match_t.url)
True
>>>
>>> asc.match("13-я", asc.match_t.abbr)
True
>>>
asc.match("13-я-й", asc.match_t.abbr)
False
>>>
asc.match("т.д", asc.match_t.abbr)
True
>>>
asc.match("т.п.", asc.match_t.abbr)
True
>>>
>>> asc.match("С.Ш.А.", asc.match_t.abbr)
True
>>>
>>> asc.addAbbr("сша")
>>> asc.match("США", asc.match_t.abbr)
True
>>>
>>> asc.addSuffix("15-летия")
>>> asc.match("15-летия", asc.match_t.abbr)
True
>>>
>>> asc.getSuffixes()
{3139900457}
>>>
>>> asc.idw("лет")
328041
>>>
>>> asc.idw("тых")
352214
>>>
>>> asc.setSuffixes({328041, 352214})
>>>
>>> asc.getSuffixes()
{328041, 352214}
>>>
>>> def status(status):
... print(status)
...
>>> asc.readSuffix("./suffix.abbr", status)
>>>
>>> asc.match("15-лет", asc.match_t.abbr)
True
>>>
>>> asc.match("20-тых", asc.match_t.abbr)
True
>>>
>>> asc.match("15-летия", asc.match_t.abbr)
False
>>>
>>> asc.match("Hello", asc.match_t.latian)
True
>>>
>>> asc.match("прiвет", asc.match_t.latian)
False
>>>
>>> asc.match("23424", asc.match_t.number)
True
>>>
>>> asc.match("hello", asc.match_t.number)
False
>>>
>>> asc.match("23424.55", asc.match_t.number)
False
>>>
>>> asc.match("23424", asc.match_t.decimal)
False
>>>
>>> asc.match("23424.55", asc.match_t.decimal)
True
>>>
>>> asc.match("23424,55", asc.match_t.decimal)
True
>>>
>>> asc.match("-23424.55", asc.match_t.decimal)
True
>>>
>>> asc.match("+23424.55", asc.match_t.decimal)
True
>>>
>>> asc.match("+23424.55", asc.match_t.anumber)
True
>>>
>>> asc.match("15T-34", asc.match_t.anumber)
True
>>>
>>> asc.match("hello", asc.match_t.anumber)
False
>>>
>>> asc.match("hello", asc.match_t.allowed)
True
>>>
>>> asc.match("évaluer", asc.match_t.allowed)
False
>>>
>>> asc.match("13", asc.match_t.allowed)
True
>>>
>>> asc.match("Hello-World", asc.match_t.allowed)
True
>>>
>>> asc.match("Hello", asc.match_t.math)
False
>>>
>>> asc.match("+", asc.match_t.math)
True
>>>
>>> asc.match("=", asc.match_t.math)
True
>>>
>>> asc.match("Hello", asc.match_t.upper)
True
>>>
>>> asc.match("hello", asc.match_t.upper)
False
>>>
>>> asc.match("hellO", asc.match_t.upper)
False
>>>
>>> asc.match("a", asc.match_t.punct)
False
>>>
>>> asc.match(",", asc.match_t.punct)
True
>>>
>>> asc.match(" ", asc.match_t.space)
True
>>>
>>> asc.match("a", asc.match_t.space)
False
>>>
>>> asc.match("a", asc.match_t.special)
False
>>>
>>> asc.match("±", asc.match_t.special)
False
>>>
>>> asc.match("[", asc.match_t.isolation)
True
>>>
>>> asc.match("a", asc.match_t.isolation)
False
>>>
>>> asc.match("a", asc.match_t.greek)
False
>>>
>>> asc.match("Ψ", asc.match_t.greek)
True
>>>
>>> asc.match("->", asc.match_t.route)
False
>>>
>>> asc.match("⇔", asc.match_t.route)
True
>>>
>>> asc.match("a", asc.match_t.letter)
True
>>>
>>> asc.match("!", asc.match_t.letter)
False
>>>
>>> asc.match("!", asc.match_t.pcards)
False
>>>
>>> asc.match("♣", asc.match_t.pcards)
True
>>>
>>> asc.match("p", asc.match_t.currency)
False
>>>
>>> asc.match("$", asc.match_t.currency)
True
>>>
>>> asc.match("€", asc.match_t.currency)
True
>>>
>>> asc.match("₽", asc.match_t.currency)
True
>>>
>>> asc.match("₿", asc.match_t.currency)
True
Methods:
- delInText - Method for delete letter in text
Example:
>>> import asc
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", asc.wdel_t.punct)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>>
>>> asc.delInText("hello-world-hello-world", asc.wdel_t.hyphen)
'helloworldhelloworld'
>>>
>>> asc.delInText("неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор??? с лязгом выкатился и остановился возле мальчика....", asc.wdel_t.broken)
'неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор с лязгом выкатился и остановился возле мальчика'
>>>
>>> asc.delInText("«On nous dit qu'aujourd'hui c'est le cas, encore faudra-t-il l'évaluer» l'astronomie", asc.wdel_t.broken)
"On nous dit qu'aujourd'hui c'est le cas encore faudra-t-il l'valuer l'astronomie"
Methods:
- countsByFiles - Method for counting the number of n-grams in a text file
Example:
>>> import asc
>>>
>>> asc.setOption(asc.options_t.debug)
>>>
>>> asc.setOption(asc.options_t.confidence)
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.readArpa('./lm.arpa')
>>>
>>> asc.countsByFiles("./text.txt", "./result.txt", 3)
info: 0 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 0 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 10 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
Counts 3grams: 471
>>>
>>> asc.countsByFiles("./corpus", "./result.txt", 2, "txt")
info: 19 | Так как эти яйца жалко есть а хочется все больше любоваться их можно покрыть лаком даже прозрачным лаком для ногтей
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор.с лязгом выкатился и остановился возле мальчика.
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор!С лязгом выкатился и остановился возле мальчика.
info: 10 | кукай <unk> <unk> сари кукай <unk> <unk> сари японские каллиграфы я постоянно навещал их тайно от людей
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???С лязгом выкатился и остановился возле мальчика....
info: 12 | Неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор?С лязгом выкатился и остановился возле мальчика.
info: 27 | Сегодня яичницей никто не завтракал как впрочем и вчера на ближайшем к нам рынке мы ели фруктовый салат со свежевыжатым соком как в старые добрые времена в Бразилии
Counts 2grams: 20270
Description
N-gram size |
Description |
1 |
language model size |
2 |
bigram |
3 |
trigram |