Skip to main content

A lightweight, fast English lemmatizer

Project description

LightLemma

PyPI version PyPI - Python Version

A lightweight, fast English lemmatizer and stemmer. LightLemma focuses on providing high-performance text normalization for English text while maintaining a minimal footprint.

Introduction to Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form (lemma). This process uses morphological analysis and dictionary lookups to transform words into their canonical forms. For example:

  • "running" → "run"
  • "better" → "good"
  • "studies" → "study"
  • "am", "are", "is" → "be"

Unlike stemming, lemmatization considers the context and part of speech of words to produce linguistically valid results. It uses a dictionary-based approach to ensure the output is always a real word.

The Difference Between Lemmatization and Stemming

While both lemmatization and stemming aim to reduce words to their base form, they work differently:

Lemmatization:

  • Produces linguistically valid words
  • Uses dictionary lookup and morphological analysis
  • Considers word context and part of speech
  • More accurate but typically slower
  • Example: "studies" → "study"

Stemming:

  • Uses rule-based algorithms to strip affixes
  • Faster but can produce non-words
  • Doesn't consider word context
  • More aggressive reduction
  • Example: "studies" → "studi"

Choose lemmatization when you need linguistically accurate results, and stemming when you need fast, approximate word normalization.

Features

  • Fast and lightweight English lemmatization
  • Porter Stemmer implementation
  • Simple, easy-to-use API
  • No external dependencies
  • Optimized for performance
  • Future integration with contraction_fix and emoticon_fix

Installation

pip install lightlemma

Usage

from lightlemma import lemmatize, stem

# Simple word lemmatization
word = "running"
lemma = lemmatize(word)
print(lemma)  # Output: "run"

# Process multiple words with lemmatization
words = ["cats", "running", "better", "studies"]
lemmas = [lemmatize(word) for word in words]
print(lemmas)  # Output: ["cat", "run", "good", "study"]

# Using the Porter Stemmer
word = "running"
stemmed = stem(word)
print(stemmed)  # Output: "run"

# Compare lemmatization vs stemming
words = ["studies", "universal", "maximum"]
lemmas = [lemmatize(word) for word in words]
stems = [stem(word) for word in words]
print(lemmas)  # Output: ["study", "universal", "maximum"]
print(stems)   # Output: ["studi", "univers", "maxim"]

Performance

LightLemma is designed to be faster and more memory-efficient than existing solutions while maintaining high accuracy for English text.

Future Features

  • Integration with contraction_fix for handling contractions
  • Integration with emoticon_fix for emoticon normalization
  • Support for additional text normalization features

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightlemma-0.1.2.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightlemma-0.1.2-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file lightlemma-0.1.2.tar.gz.

File metadata

  • Download URL: lightlemma-0.1.2.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for lightlemma-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b4f3548bc7f723ccb42f4743ff003ad15b42bafbf7ae7dbe1128ada006744042
MD5 8f0a1ec4853d052cc6876bd84fe443aa
BLAKE2b-256 8b4121f9b5ad872c3745eeebd268ced68cb7eeb8fb0f347a89ce3f562c680bc5

See more details on using hashes here.

File details

Details for the file lightlemma-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: lightlemma-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for lightlemma-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 131ac9153e32178e7b9d5798ae43f59f7d4d6a0fddb5e71ae6b4e05315db0e2b
MD5 5526a1488ed116788f914d16c255a412
BLAKE2b-256 3da1f9f62548a89ef7835c48f457dcb2e088a0218d7766a924ba36b663e1dfed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page