Preparing russian hockey news for machine learning

These details have not been verified by PyPI

Project links

Project description

Khl Logo

No Water - Ice Only

Preparing russian hockey news for machine learning.

Unify -> Simplify -> Preprocess text and feed your neural model.

Installation

Khl is available on PyPI:

$ pip install khl

It requires Python 3.8+ to run.

Usage

To get started right away with basic usage:

from khl import text_to_codes

coder = {
    '': 0,     # placeholder
    '???': 1,  # unknown
    '.': 2,
    'и': 3,
    'в': 4,
    '-': 5,
    ':': 6,
    'матч': 7,
    'за': 8,
    'забить': 9,
    'гол': 10,
    'per': 11,   # person entity
    'org': 12,   # organization entity
    'loc': 13,   # location entity
    'date': 14,  # date entity
    'против': 15,
    'год': 16,
    'pers': 17,  # few persons entity
    'orgs': 18,  # few organizations entity
    'свой': 19
}

text = """
    1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
    «Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""

codes = text_to_codes(
    text=text,
    coder=coder,
    stop_words_=["за", "и", "свой"],  # stop words to drop
    replace_ners_=True,               # replace named entities ("Иван Иванов" -> "per", "Спартак" -> "org", "Москва" -> "loc")
    replace_dates_=True,              # replace dates ("1 апреля 2023 года" -> "date")
    replace_penalties_=True,          # replace penalties ("5+20" -> "pen")
    exclude_unknown=True,             # drop lemma that not presented in coder
    max_len=20,                       # get sequence of codes of length 20
)
# codes = [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]

text_to_codes is a very high level function. What's happens under hood see in Lower level usage.

What is `coder`?

coder is just a dictionary where each lemma is represented with unique integer code. Note that first two elements are reserved for placeholder and unknown elements.

It is possible to get coder from frequency dictionary file (see in Get lemmas coder). Frequency dictionary file is a json-file with dictionary where key is lemma and value is how many times this lemma occurred in your whole dataset. Preferably it should be sorted in descending order of values.
example_frequency_dictionary.json:

{
  ".": 1000,
  "и": 500,
  "в": 400,
  "-": 300,
  ":": 300,
  "матч": 290,
  "за": 250,
  "забить": 240,
  "гол": 230,
  "per": 200,
  "org": 150,
  "loc": 150,
  "date": 100,
  "против": 90,
  "год": 70,
  "pers": 40,
  "orgs": 30,
  "свой": 20
}

You could make and use your own frequency dictionary or download this dictionary created by myself.

Lower level usage

1. Make imports

from khl import stop_words
from khl import utils
from khl import preprocess

2. Get lemmas coder

coder = preprocess.get_coder("example_frequency_dictionary.json")

3. Define text

text = """
    1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
    «Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""

4. Unify

unified_text = utils.unify(text)
# "1 апреля 2023 года в Москве в матче 1/8 финала против 'Спартака' Иван Иванов забил свой 100-й гол за карьеру. 'Динамо Мск' - 'Спартак' 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров."

5. Simplify

simplified_text = utils.simplify(
    text=unified_text,
    replace_ners_=True,
    replace_dates_=True,
    replace_penalties_=True,
)
# 'date в loc в матче финала против org per забил свой гол за карьеру. org org Голы забили: per per per.'

6. Lemmatize

lemmas = preprocess.lemmatize(text=simplified_text, stop_words_=stop_words)
# ['date', 'в', 'loc', 'в', 'матч', 'финал', 'против', 'org', 'per', 'забить', 'гол', 'карьера', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']

7. Transform to codes

codes = preprocess.lemmas_to_codes(
    lemmas=lemmas,
    coder=coder,
    exclude_unknown=True,
    max_len=20,
)
# [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]

8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)

print(
    preprocess.codes_to_lemmas(codes=codes, coder=coder)
)
# ['', '', '', 'date', 'в', 'loc', 'в', 'матч', 'против', 'org', 'per', 'забить', 'гол', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.2

Nov 25, 2023

2.0.1

May 7, 2023

This version

2.0.0

Apr 23, 2023

1.0.7

Apr 22, 2023

1.0.6

Apr 20, 2023

1.0.5

Mar 10, 2023

1.0.4

Mar 5, 2023

1.0.3

Mar 5, 2023

1.0.2

Feb 25, 2023

1.0.1

Jan 29, 2023

1.0.0

Jan 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khl-2.0.0.tar.gz (21.7 kB view details)

Uploaded Apr 23, 2023 Source

Built Distribution

khl-2.0.0-py3-none-any.whl (20.4 kB view details)

Uploaded Apr 23, 2023 Python 3

File details

Details for the file khl-2.0.0.tar.gz.

File metadata

Download URL: khl-2.0.0.tar.gz
Upload date: Apr 23, 2023
Size: 21.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-69-generic

File hashes

Hashes for khl-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`831f3af2172b6913f0a4d8868b7b835c8952b74fc49b38a3ba01b14e4d29851f`
MD5	`502f8f9a7d53513801a191817cab06f7`
BLAKE2b-256	`e8a1250df30b9a5b8e5e7a752d255ae9fa694c19e5536b0e579209d78346f8a5`

See more details on using hashes here.

File details

Details for the file khl-2.0.0-py3-none-any.whl.

File metadata

Download URL: khl-2.0.0-py3-none-any.whl
Upload date: Apr 23, 2023
Size: 20.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-69-generic

File hashes

Hashes for khl-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2ad13c8065253d6fe2fa403898c4a06986a750d8f68069b9f112d47e7eacb80`
MD5	`2adcd0db1e24ef73f61a8abf21239b86`
BLAKE2b-256	`41e48e7872a173860c2c4b3bdfa016b33af65868cd66268a2c51fd9c01c06783`

See more details on using hashes here.

khl 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

No Water - Ice Only

Installation

Usage

What is `coder`?

Lower level usage

1. Make imports

2. Get lemmas coder

3. Define text

4. Unify

5. Simplify

6. Lemmatize

7. Transform to codes

8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

khl 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

No Water - Ice Only

Installation

Usage

What is coder?

Lower level usage

1. Make imports

2. Get lemmas coder

3. Define text

4. Unify

5. Simplify

6. Lemmatize

7. Transform to codes

8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What is `coder`?