Skip to main content

A pure-python wrapper for the pykko Finnish morphological analyser and inflector

Project description

PyPykko

PyPykko is a wrapper around pykko. It provides the basic analysis and generation API in an easily installable package. PyPykko can be installed without compiling anything (as the transducers are pre-compiled) or pulling in any native dependencies (such as hfst).

This package contains (slightly modified for kfst compatibility versions of) all the files in the tools directory of pykko as well as constants.py and file_tools.py from the scripts directory and utils.py from the scripts directory as scriptutils.py. It also provides the novel reinflect.py and extras.py. The function utils.analyze returns a NamedTuple as opposed to the unnamed tuple returned by upstream Pykko as of writing.

Installation

PyPykko is available on PyPI and can be installed with pip:

pip install pypykko

Usage

There are two main Python methods utils.analyze and generate.generate_wordform inherited from Pykko proper; besides these there is reinflect.reinflect that is perhaps a more suitable interface for general reinflection. There is also bolted-on alignment support in extras.analyze_with_compound_parts.

reinflect.reinflect

reinflect.reinflect tries to reinflect a word to the best of its ability. It can be instructed either with a model word or with a specific form. Further, it can be given the form the original word was in if known ahead of time and the part-of-speech of the word.

>>> from pypykko.reinflect import reinflect
>>> reinflect("mökkiammeemme", model="talossa")
{'mökkiammeessa'}
>>> reinflect("esijuosta", model="katselemme")
{'esijuoksemme'}
>>> reinflect("mökkiammeemme", new_form="+sg+nom")
{'mökkiamme'}
>>> reinflect("möhkö", new_form="+pl+ine+ko")
{'möhköissäkö'}
>>> reinflect("viinissä", model="talot")
{'viinet'}
>>> reinflect("viinissä", model="talot", orig_form="+sg+ine")
{'viinit'}
>>> reinflect("hömppäämme", model="juokset", pos="verb")
{'hömppäät'}
>>> reinflect("hömppäämme", model="juokset", pos="noun")
{'hömpät'}

utils.analyze and extras.analyze_with_compound_parts

utils.analyze should be used in most cases:

>>> from pypykko.utils import analyze
>>> analyze("hätkähtäneet")
[PykkoAnalysis(wordform='hätkähtäneet', source='Lexicon', lemma='hätkähtää', pos='verb', homonym='', info='', morphtags='+past+conneg+pl', weight=0.0),
 PykkoAnalysis(wordform='hätkähtäneet', source='Lexicon', lemma='hätkähtää', pos='verb', homonym='', info='', morphtags='+part_past+pl+nom', weight=0.0),
 PykkoAnalysis(wordform='hätkähtäneet', source='Lexicon', lemma='hätkähtänyt', pos='participle', homonym='', info=' ← verb:hätkähtää:+part_past', morphtags='+pl+nom', weight=0.0)]

The fields of the outcoming tuple are:

  • wordform: Surface form (input as it is given)
  • source: The source of the word: eg. Lexicon if it is a word known ahead of time, Guesser|Any for unknown words and Lexicon|Pfx for words analyzed as the compounds of known words.
  • lemma: The lemma form of the word; notably this can contain pipe symbols to delimit compound parts: ilma|luukku. Sometimes Finnish has infix inflection, and the compound parts can be separately inflected (eg. uudenvuoden -> uusi|vuosi).
  • pos: The part of speech of the word.
  • homonym: The homonym number of the word (can be empty). Eg. the word viini has two senses that have slightly different inflection: wine (viini -> viinin) and quiver (viini -> viinen). In cases where such homonyms exist but it is impossible to tell which form is presented (the nominative form viini here), we get both interpretations:
[PykkoAnalysis(wordform='viini', source='Lexicon', lemma='viini', pos='noun', homonym='1', info='', morphtags='+sg+nom', weight=0.0),
 PykkoAnalysis(wordform='viini', source='Lexicon', lemma='viini', pos='noun', homonym='2', info='', morphtags='+sg+nom', weight=0.0)]

In cases where the form is unambiguous (eg. viinen), we get only the homonym number that is relevant:

[PykkoAnalysis(wordform='viinen', source='Lexicon', lemma='viini', pos='noun', homonym='2', info='', morphtags='+sg+gen', weight=0.0)]

In cases where the homonym is different in different interpretations, we get annotated interpretations:

[PykkoAnalysis(wordform='viinin', source='Lexicon', lemma='viini', pos='noun', homonym='2', info='', morphtags='+pl+ins', weight=0.0),
 PykkoAnalysis(wordform='viinin', source='Lexicon', lemma='viini', pos='noun', homonym='1', info='', morphtags='+sg+gen', weight=0.0)]
  • info: Either a register annotation or information on a derivation, eg:
>>> analyze("höpsöillä")
[PykkoAnalysis(wordform='höpsöillä', source='Lexicon', lemma='höpsö', pos='noun', homonym='', info='⟨coll⟩', morphtags='+pl+ade', weight=0.0), PykkoAnalysis(wordform='höpsöillä', source='Lexicon', lemma='höpsö', pos='adjective', homonym='', info='⟨coll⟩', morphtags='+pl+ade', weight=0.0)]
>>> analyze("kulkenut")
[PykkoAnalysis(wordform='kulkenut', source='Lexicon', lemma='kulkea', pos='verb', homonym='', info='', morphtags='+past+conneg+sg', weight=0.0), PykkoAnalysis(wordform='kulkenut', source='Lexicon', lemma='kulkea', pos='verb', homonym='', info='', morphtags='+part_past+sg+nom', weight=0.0), PykkoAnalysis(wordform='kulkenut', source='Lexicon', lemma='kulkenut', pos='participle', homonym='', info=' ← verb:kulkea:+part_past', morphtags='+sg+nom', weight=0.0)]
  • morphtags: Morphological tags that name the inflectional form.
  • weight: The weight of this analysis per the FST. Generally, lower weights are more probable.

extras.analyze\_with\_compound\_parts is of use when it is useful to know the exact inflected forms of the compound parts of a word. Eg. when looking at "isonvarpaan", one might want to not only know that it is the compound of "iso" and "varvas" but also that they are in the forms "ison" and "varpaan". extras.anlyze\_with\_compound\_parts returns the character ranges matching compound parts.

>>> analyze_with_compound_parts("isonvarpaan")
[RangedPykkoAnalysis(wordform='isonvarpaan', source='Lexicon', lemma='iso|varvas', pos='noun', homonym='', info='', morphtags='+sg+gen', weight=0.0, ranges=(range(0, 4), range(4, 11)))]

generate.generate_wordform

generate\_wordform is a simple-to-use api to inflect in-lexicon words.

>>> from pypykko.generate import generate_wordform
>>> generate_wordform("höpönassu", "noun", '+pl+abe+ko')
{'höpönassuittako'}

License

PyPykko is licensed under the MIT license like Pykko itself, as it is mostly constituted of Pykko's files with minor modifications. See the LICENSE file for details. Note that kfst (and kfst-rs) have less permissive licenses.

Files from Pykko itself are modified from the version in commit 9bf1f02a3b03046955a82643e273b6fc3b28174f. The compiled transducers are from the same commit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypykko-0.4.0b0.tar.gz (9.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pypykko-0.4.0b0-py3-none-any.whl (9.0 MB view details)

Uploaded Python 3

File details

Details for the file pypykko-0.4.0b0.tar.gz.

File metadata

  • Download URL: pypykko-0.4.0b0.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pypykko-0.4.0b0.tar.gz
Algorithm Hash digest
SHA256 84292baa8636e37dcf3ccf67cfcf779b99ab9a84b890a5f8434dfc88d8f9a188
MD5 79f88eab99af0c89e76e65da7ed8d038
BLAKE2b-256 c71a29998efecccc837bf4d911cf1e30abfab1e2144e845c7f1a7fd879fe9760

See more details on using hashes here.

File details

Details for the file pypykko-0.4.0b0-py3-none-any.whl.

File metadata

  • Download URL: pypykko-0.4.0b0-py3-none-any.whl
  • Upload date:
  • Size: 9.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pypykko-0.4.0b0-py3-none-any.whl
Algorithm Hash digest
SHA256 32a429f2ead0f6742413ff87c8c2053bc978f9b81710e3d0fa9206f18878e2a1
MD5 6cfc180b4d838a87687d9fb93d3961e0
BLAKE2b-256 00d36992be005e0e3b78d98cf94cfba5b9db87908b190810c25c77d6dd9344c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page