Preparing russian hockey news for machine learning
Project description
No Water - Ice Only
Preparing russian hockey news for machine learning.
Unify -> Simplify -> Preprocess text and feed your neural model.
Installation
Khl is available on PyPI:
$ pip install khl
It requires Python 3.8+ to run.
Usage
To get started right away with basic usage:
from khl import text_to_codes
coder = {
'': 0, # placeholder
'???': 1, # unknown
'.': 2,
'и': 3,
'в': 4,
'-': 5,
':': 6,
'матч': 7,
'за': 8,
'забить': 9,
'гол': 10,
'per': 11, # person entity
'org': 12, # organization entity
'loc': 13, # location entity
'date': 14, # date entity
'против': 15,
'год': 16,
'pers': 17, # few persons entity
'orgs': 18, # few organizations entity
'свой': 19
}
text = """
1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
«Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""
codes = text_to_codes(
text=text,
coder=coder,
stop_words_=["за", "и", "свой"], # stop words to drop
replace_ners_=True, # replace named entities ("Иван Иванов" -> "per", "Спартак" -> "org", "Москва" -> "loc")
replace_dates_=True, # replace dates ("1 апреля 2023 года" -> "date")
replace_penalties_=True, # replace penalties ("5+20" -> "pen")
exclude_unknown=True, # drop lemma that not presented in coder
max_len=20, # get sequence of codes of length 20
)
# codes = [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]
text_to_codes
is a very high level function. What's happens under hood see in Lower level usage.
What is coder
?
coder
is just a dictionary where each lemma is represented with unique integer code.
Note that first two elements are reserved for placeholder and unknown elements.
It is possible to get coder
from frequency dictionary file (see in Get lemmas coder).
Frequency dictionary file is a json-file with dictionary where key is lemma and value is how many times this lemma occurred in your whole dataset.
Preferably it should be sorted in descending order of values.
example_frequency_dictionary.json
:
{
".": 1000,
"и": 500,
"в": 400,
"-": 300,
":": 300,
"матч": 290,
"за": 250,
"забить": 240,
"гол": 230,
"per": 200,
"org": 150,
"loc": 150,
"date": 100,
"против": 90,
"год": 70,
"pers": 40,
"orgs": 30,
"свой": 20
}
You could make and use your own frequency dictionary or download this dictionary created by myself.
Lower level usage
1. Make imports
from khl import stop_words
from khl import utils
from khl import preprocess
2. Get lemmas coder
coder = preprocess.get_coder("example_frequency_dictionary.json")
3. Define text
text = """
1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
«Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""
4. Unify
unified_text = utils.unify(text)
# "1 апреля 2023 года в Москве в матче 1/8 финала против 'Спартака' Иван Иванов забил свой 100-й гол за карьеру. 'Динамо Мск' - 'Спартак' 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров."
5. Simplify
simplified_text = utils.simplify(
text=unified_text,
replace_ners_=True,
replace_dates_=True,
replace_penalties_=True,
)
# 'date в loc в матче финала против org per забил свой гол за карьеру. org org Голы забили: per per per.'
6. Lemmatize
lemmas = preprocess.lemmatize(text=simplified_text, stop_words_=stop_words)
# ['date', 'в', 'loc', 'в', 'матч', 'финал', 'против', 'org', 'per', 'забить', 'гол', 'карьера', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']
7. Transform to codes
codes = preprocess.lemmas_to_codes(
lemmas=lemmas,
coder=coder,
exclude_unknown=True,
max_len=20,
)
# [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]
8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)
print(
preprocess.codes_to_lemmas(codes=codes, coder=coder)
)
# ['', '', '', 'date', 'в', 'loc', 'в', 'матч', 'против', 'org', 'per', 'забить', 'гол', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file khl-2.0.2.tar.gz
.
File metadata
- Download URL: khl-2.0.2.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-89-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23c288c90304a2efb9c63e3c4a512a542d05a150cd068cf40f820f87c0298337 |
|
MD5 | 388556ee76809fc18ef1ba5ebb6f0896 |
|
BLAKE2b-256 | 6f9eaa41785eebb3f314f51a2ba52bbc80a8ceef4a06889a44eb12149ddcd38f |
File details
Details for the file khl-2.0.2-py3-none-any.whl
.
File metadata
- Download URL: khl-2.0.2-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-89-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a736205310f1c66428fb1fa8a916cb875f0dbe72f38bb8f524847bb6cd29ea3f |
|
MD5 | 3484e419d2d3c669bc52223a1a901b84 |
|
BLAKE2b-256 | d00528c5eabff72031db481d18e9d598af63bc23bd7720c0ef245a36f64b7a4e |