Skip to main content

A lightweight, fast English lemmatizer

Project description

LightLemma

PyPI version PyPI status PyPI - Python Version

A lightweight, fast English lemmatizer and stemmer. LightLemma focuses on providing high-performance text normalization for English text while maintaining a minimal footprint.

Introduction to Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form (lemma). This process uses morphological analysis and dictionary lookups to transform words into their canonical forms. For example:

  • "running" → "run"
  • "better" → "good"
  • "studies" → "study"
  • "am", "are", "is" → "be"

Unlike stemming, lemmatization considers the context and part of speech of words to produce linguistically valid results. It uses a dictionary-based approach to ensure the output is always a real word.

The Difference Between Simple Lemmatization and Stemming

While both lemmatization and stemming aim to reduce words to their base form, they work differently:

Lemmatization:

  • Produces linguistically valid words
  • Uses dictionary lookup and morphological analysis
  • Considers word context and part of speech
  • More accurate but typically slower
  • Example: "studies" → "study"

Stemming:

  • Uses rule-based algorithms to strip affixes
  • Faster but can produce non-words
  • Doesn't consider word context
  • More aggressive reduction
  • Example: "studies" → "studi"

Choose lemmatization when you need linguistically accurate results, and stemming when you need fast, approximate word normalization.

Features

  • Fast and lightweight English lemmatization
  • Porter Stemmer implementation
  • Simple, easy-to-use API
  • No external dependencies
  • Optimized for performance
  • Future integration with contraction_fix and emoticon_fix

Installation

pip install lightlemma

Usage

from lightlemma import lemmatize, stem

# Simple word lemmatization
word = "running"
lemma = lemmatize(word)
print(lemma)  # Output: "run"

# Process multiple words with lemmatization
words = ["cats", "running", "better", "studies"]
lemmas = [lemmatize(word) for word in words]
print(lemmas)  # Output: ["cat", "run", "good", "study"]

# Using the Porter Stemmer
word = "running"
stemmed = stem(word)
print(stemmed)  # Output: "run"

# Compare lemmatization vs stemming
words = ["studies", "universal", "maximum"]
lemmas = [lemmatize(word) for word in words]
stems = [stem(word) for word in words]
print(lemmas)  # Output: ["study", "universal", "maximum"]
print(stems)   # Output: ["studi", "univers", "maxim"]

Performance

LightLemma is designed to be faster and more memory-efficient than existing solutions while maintaining high accuracy for English text.

Future Features

  • Integration with contraction_fix for handling contractions
  • Integration with emoticon_fix for emoticon normalization
  • Support for additional text normalization features

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightlemma-0.1.1.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightlemma-0.1.1-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file lightlemma-0.1.1.tar.gz.

File metadata

  • Download URL: lightlemma-0.1.1.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for lightlemma-0.1.1.tar.gz
Algorithm Hash digest
SHA256 70d6cf20698f14e98d85ca53d307dbb2172f5c23b007b1cba4d04960347f0d68
MD5 e5962b84c56cee046185f9d15fa4a2e3
BLAKE2b-256 d1c8cb322be5658f4a8a3e4e53d9c95131a79c400b6a7559f092b3951c052063

See more details on using hashes here.

File details

Details for the file lightlemma-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: lightlemma-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for lightlemma-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ba6374973339a4878bb177b7054a66385991d530827abd5969e318f85fb1d61a
MD5 ae624f9142f332526148f305ddcfd687
BLAKE2b-256 3229e3a7bd4fa39307d7cefe3735289e46e625c21aa1810b240b54db27b6c1e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page