Skip to main content

LemonTizer is a class that wraps the spacy library to build a lemmatizer for language learning applications.

Project description

Description

LemonTizer is a class that wraps the spacy library to build a lemmatizer for language learning applications. It automatically manages the installation and loading of all languages supported by spacy and provides various lemmatizations options.

It is designed so that lemmatization can be enabled for multiple languages with the same amount of effort as enabling it for one, thus making community made scripts more widely accessible.

(for those curious, lemon tizer is a pun on the Scottish soft drink which used to come in various fruit flavours)

Quickstart

First, install lemon-tizer using pip:

pip install lemon-tizer

Example of lemmatizing a single sentence:

# Import class
from lemon_tizer import LemonTizer

# Initialise class
# Language should be a lower case 2 letter code, see "Supported Languages" table for list of abbreviations
# Model size depends on availability of models, see https://spacy.io/models
# Normally, these are "sm", "md", "lg"
# Larger models are more accurate and support more features but require more storage space and may take longer to run
lemma = LemonTizer(language="en", model_size= "lg")

# Lemmatize a test string and print the result
test_string = "I am going to the shops to buy a can of Tizer."
output = lemma.lemmatize_sentence(test_string)
print(output)

This would produce the following output:

"""
Output:
[{'I': 'I'},
 {'am': 'be'},
 {'going': 'go'},
 {'to': 'to'},
 {'the': 'the'},
 {'shops': 'shop'},
 {'to': 'to'},
 {'buy': 'buy'},
 {'a': 'a'},
 {'can': 'can'},
 {'of': 'of'},
 {'Tizer': 'Tizer'},
 {'.': '.'}]
"""

Script settings

You can also enable various settings to exclude punctuation, exclude common words, force the input to lower case to change the behaviour, etc. A use case of this would be creating a frequency analysis of calculating the words in a text.

Example:

# Import class
from lemon_tizer import LemonTizer

# Initialise class
lemma = LemonTizer(language="en", model_size= "lg")

# Configure settings
lemma.set_lemma_settings(filter_out_non_alpha=True,
    filter_out_common=True,
    convert_input_to_lower=True,
    convert_output_to_lower=True,
    return_just_first_word_of_lemma=True
)

# Lemmatize a test string and print the result
test_string = "I am going to the shops to buy a can of Tizer."
output = lemma.lemmatize_sentence(test_string)
print(output)

This would produce the following output:

"""
Output:
[{'going': 'go'}, {'shops': 'shop'}, {'buy': 'buy'}, {'tizer': 'tizer'}]
"""

The options are:

Boolean Variable Explanation
filter_out_non_alpha Will filter out lemmatizations that contain non-alpha characters. Useful for removing punctuation, etc. Note: lemmatizations with an apostrophe will also be filtered if this is set!
filter_out_common Will filter out common words such as "the, and, she". Useful when doing frequency analysis.
convert_input_to_lower Forces the input string to lowercase. May be useful to increase accuracy in some languages.
convert_output_to_lower Forces the lemmatization to be lower case to change the behaviour of the algorithm, particularly in relation to the identification of proper nouns.
return_just_first_word_of_lemma Some lemmatizations will return multiple words for a given input token. Setting this to True will return just the first word.

Advanced Functions

You can call LemonTizer.get_spacy_object() to get the underlying spacy object which has been initialised to a given model, should you wish to use functions not exposed by the wrapper.

Public Functions and Properties

def init_model(language: str, model_size: str) -> None:
    """Loads model based upon specified language and model size.
    If model hasn't been downloaded, it will download it prior to the loading step.
    Also loads default settings for lemmatization.

    Args:
        language: Lower case two letter code matching language codes in https://spacy.io/models
        model_size: Lower case two letter code matching sm, md, lg, etc.
            in https://spacy.io/models
    """

def set_lemma_settings(filter_out_non_alpha: bool = False,
    filter_out_common: bool = False,
    convert_input_to_lower: bool = False,
    convert_output_to_lower: bool = False,
    return_just_first_word_of_lemma: bool = False) -> None:
    """ Sets various settings for lemmatisation
    Args:
        filter_out_non_alpha: (bool) Will filter out lemmatizations that contain non-alpha
            characters. Useful for removing punctuation, etc. Note: lemmatizations with an
            apostrophe will also be filtered if this is set!
        filter_out_common: (bool) Will filter out common words such as "the, and, she". Useful
            when doing frequency analysis.
        convert_input_to_lower: (bool) Forces the input string to lowercase. May be useful to
            increase accuracy in some languages.
        convert_output_to_lower: (bool) Optionally force the lemmatization to be lower case.
        return_just_first_word_of_lemma: (bool) Some lemmatizations will return multiple words
            for a given input token. Setting this to True will return just the first word.
    """

def lemmatize_sentence(input_str: str) -> list[dict[str, str]]:
    """Lemmatizes a sentence (can also be a word, paragraph, etc.)
    Returns:
        Lists of dictionaries which has the original token as the key (str) and lemmatized
        token as the value (str)

    Args:
        input_str: String containing the data to be lemmatized
    """

def find_model_name(language: str, model_size: str) -> str:
    """Looks up models compatible with the installed version of spacy, based upon language code
    and model size.

    Returns:
        spacy model name (str)
    Args:
        language: Lower case two letter code matching language codes in https://spacy.io/models
        model_size: Lower case two letter code matching sm, md, lg, etc.
            in https://spacy.io/models
    """

def download_model(model_name: str) -> None:
    """Downloads spacy model ("trained pipeline") to local storage
    Args:
        model_name: should match a model in the spacy documentation,
        see https://spacy.io/models

    Use the method is_model_installed() if you need to check if model has already been
    downloaded.

    Use the method find_model_name() to get available models based upon language and model size
    """

def get_available_models() -> list[str]:
        """ Gets the list of available pre-trained models for the installed version of spacy
        Returns:
            List of strings with the names of spacy trained models
        """

def is_model_installed(model_name: str) -> bool:
        """
        Returns:
            True if model is found in local storage, otherwise False
        """
@property
def get_current_model_name() -> str:
    """
    Returns:
        Name of currently loaded model as a str
    """

@property
def get_spacy_object() -> spacy.language.Language:
    """
    Returns:
        Returns the spacy Language object aka "model" for external processing
    """

Supported languages

The supported languages are determined by the installed version of spacy, see here: languages.

At the time of writing, the following languages are supported:

Abbreviation Language Name
ca Catalan
zh Chinese
hr Croatian
da Danish
nl Dutch
en English
fi Finnish
fr French
de German
el Greek
it Italian
ja Japanese
ko Korean
lt Lithuanian
mk Macedonian
xx Multi-language
nb Norwegian Bokmål
pl Polish
pt Portuguese
ro Romanian
ru Russian
sl Slovenian
es Spanish
sv Swedish
uk Ukrainian

Acknowledgements

Unless otherwise noted, all materials within this repository are Copyright (C) 2024 Jonathan Fox.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lemon_tizer-0.0.6.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lemon_tizer-0.0.6-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file lemon_tizer-0.0.6.tar.gz.

File metadata

  • Download URL: lemon_tizer-0.0.6.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for lemon_tizer-0.0.6.tar.gz
Algorithm Hash digest
SHA256 9789aaf4bf6781637a8af2703e0537c312f0c5a24ef19ae15ef1a166ad82ef92
MD5 7d6a702f6e59d3cfc91c57bc79ddb5f6
BLAKE2b-256 30f82a6816a27601b470a2321d1ad45e20b78b95a90bed3c16b7b774210855e4

See more details on using hashes here.

File details

Details for the file lemon_tizer-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: lemon_tizer-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for lemon_tizer-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d2d94acb45f5026a56dd4c8f41adcb8f2570c511aad5ef3f32b3139c62fdf551
MD5 df000ab675d73683f075521cbe393600
BLAKE2b-256 84eccd0ae911e047e2d74f45ff988c4423a7ea15c7050f37c0606b6ea8041320

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page