Skip to main content

Use spacy to disambiguate pymorphy3 morphology analyses.

Project description

pymorphy-spacy-disambiguation

TL;DR

Use spacy morphology to pick the correct pymorphy2 morphology out of a list of options.

from pymorphy_spacy_disambiguation.disamb import Disambiguator

d = Disambiguator()

txt = "Жив був король. У нього було царство, де жило сто корів і тридцять кіз."
doc = nlp(txt)
token = doc[12]  # корів - many options, but it's cows, not measles

res = d.get_with_disambiguation(token)

res
>>> Parse(word='корів', tag=OpencorporaTag('NOUN,anim plur,accs'), normal_form='корова', score=1.0, methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 10),))

Motivation

The package pymorphy2/pymorphy2: Morphological analyzer / inflection engine for Russian and Ukrainian languages does both morphological analysis and inflection of Russian and Ukrainian words.

It does this per-word (based on both dictionary and probabilistic methods), without looking at the context. This exacerbates the problem of disambiguation.

Some words might be written identically but:

For the Russian language, pymorphy2 also gives a probability score for the different options, but this is absent for the Ukrainian language and in any case may not be enough in many cases.

How do you pick the correct morphological analysis?

Pymorphy describes (in Russian) the problem of choosing the correct morphological analysis out of multiple options: Руководство пользователя — Морфологический анализатор pymorphy2

Basically the TL;DR is either you know what you're talking about (e.g. living beings) or you use the sentence context.

Spacy also does morphology based on context, and in my experience it's much more precise in many cases.

This package uses uses spacy morphology to disambiguate between different pymorphy2 options.

Why can't we use spacy directly then?

Because pymorphy2 also does inflection, and to do it correctly it has to start with the correct form.

In the example below, to inflect корів you have to know if it's talking about cows or about measles, without the sentence around it you just don't know.

У царя не було корів.

Царю потрібні корови.

Мова йде про корову чи хвороба кір?

Howto

Basic

In: a spacy Token with morphological analysis
Out: a pymorphy2 Parse object with the most likely candidate for morphological analysis

Take a spacy token, here it's cows: 'корів'

txt = "Жив був король. У нього було царство, де жило сто корів і тридцять кіз."
doc = nlp(txt)
token = doc[12]  # корів - many options, but it's cows, not measles

Pymorphy2 would give you three different options:

pymorphy_analyzer = pymorphy2.MorphAnalyzer(lang="uk")

# options = pymorphy_analyzer.parse("корів")
options = pymorphy_analyzer.parse(token.text)

# options are:
# 	first is "кір" (measles), the other two "корова" (cow) in two different cases.
>>> [Parse(word='корів', tag=OpencorporaTag('NOUN,inan plur,gent'), normal_form='кір', score=1.0, methods_stack=((DictionaryAnalyzer(), 'корів', 498, 11),)),
 Parse(word='корів', tag=OpencorporaTag('NOUN,anim plur,gent'), normal_form='корова', score=1.0, methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 8),)),
 Parse(word='корів', tag=OpencorporaTag('NOUN,anim plur,accs'), normal_form='корова', score=1.0, methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 10),))]

Compare with spacy morphology:

# spacy 
token.morph
>>> Animacy=Anim|Case=Gen|Gender=Fem|Number=Plur

The disambiguator compares the normal form, POS and morphology of each and picks the one most consistent with the context-based spacy version:

from pymorphy_spacy_disambiguation.disamb import Disambiguator

d = Disambiguator()
res = d.get_with_disambiguation(token)

res
>>> Parse(word='корів', tag=OpencorporaTag('NOUN,anim plur,accs'), normal_form='корова', score=1.0, methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 10),))
k
assert res.normal_form == "корова"
assert str(res.tag) == "NOUN,anim plur,accs"

With weighting

The most likely morphology analysis is picked based on a similarity score.

It takes into account the following:

The grammemes get translated to Universal Dependencies FEATS format using the package kmike/russian-tagsets.

Based on your use case you might care about some more than others. For this weighting was implemented.

Spacy can provide more or less features than pymorphy2, and missing_grammeme sets the penalty for key/value pairs missing. By default there's no penalty, just no +1 that would be added if they were present and equal.

@dataclass
class SimilarityWeighting:
    # Certainty score assigned by pymorphy2 (for Russian only!).
    # This weighting is a multiplier for that score
    score: float = 1.0

    # Whether the lemma / normal_form is equal
    normal_form: float = 1.0

    # Penalty when one of the grammemes/tags is missing in one of the two dicts
    # 0.0 means "do nothing", 1.0 means "substract 1"
    missing_grammeme: float = 0.0

    # All the other cases: the Opencorpora tag classes (below)
    normal_grammeme: float = 1.0

To use:

	# Assume those two dictionary representation of word morphology.
	# Here everything differs except the normal form
    m3 = {
        "_NORMAL_FORM": "not_whatever",  # ! changed
        "Animacy": "Anim",  # ! changed from m2
        "Case": "Gen",
        "Number": "Plur",
        "_POS": "NOUN",
    }
    m3_1 = {
        "_NORMAL_FORM": "not_whatever",  # ! changed
        "Animacy": "Inan",  # ! changed from m2
        "Case": "Not genitive",
        "Number": "Sing",
        "_POS": "VERB",
    }

	# Weighting that increases the weight for normal form
    w_normal = SimilarityWeighting(normal_form=1000)

	# This weighting would make the similarity approach 1, by decreasing the 
	# importance of all the other fields.
    assert Disambiguator.weighted_calculate_morph_similarity(m3, m3_1, w_normal) > 0.99

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymorphy_spacy_disambiguation-0.1.4.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymorphy_spacy_disambiguation-0.1.4-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file pymorphy_spacy_disambiguation-0.1.4.tar.gz.

File metadata

File hashes

Hashes for pymorphy_spacy_disambiguation-0.1.4.tar.gz
Algorithm Hash digest
SHA256 af2cce27e9f97b5f4fa2255e352b01a060982ac2c858472883c76d946e3286f6
MD5 f71fdf7e7483652ab96e36a2f520eb35
BLAKE2b-256 31cb2c62c88ddc472d9048be855a8dd001b0e5ca3bc8d56c0f8d8ce9a5f74dfa

See more details on using hashes here.

File details

Details for the file pymorphy_spacy_disambiguation-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pymorphy_spacy_disambiguation-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7dd01f8aecef676b4ef0dd81c95a469391690b2664f9769c73e66d57c1c2a54c
MD5 86f38143c677729c0314f38ddfd0e59c
BLAKE2b-256 506accd575bb8c2859adbf49cc60b455e87553b625a23c81397714fdca073c06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page