A lightweight, fast English lemmatizer
Project description
LightLemma
A lightweight, fast English lemmatizer and stemmer. LightLemma focuses on providing high-performance text normalization for English text while maintaining a minimal footprint.
Introduction to Lemmatization
Lemmatization is the process of reducing words to their base or dictionary form (lemma). This process uses morphological analysis and dictionary lookups to transform words into their canonical forms. For example:
- "running" → "run"
- "better" → "good"
- "studies" → "study"
- "am", "are", "is" → "be"
Unlike stemming, lemmatization considers the context and part of speech of words to produce linguistically valid results. It uses a dictionary-based approach to ensure the output is always a real word.
The Difference Between Simple Lemmatization and Stemming
While both lemmatization and stemming aim to reduce words to their base form, they work differently:
Lemmatization:
- Produces linguistically valid words
- Uses dictionary lookup and morphological analysis
- Considers word context and part of speech
- More accurate but typically slower
- Example: "studies" → "study"
Stemming:
- Uses rule-based algorithms to strip affixes
- Faster but can produce non-words
- Doesn't consider word context
- More aggressive reduction
- Example: "studies" → "studi"
Choose lemmatization when you need linguistically accurate results, and stemming when you need fast, approximate word normalization.
Features
- Fast and lightweight English lemmatization
- Porter Stemmer implementation
- Simple, easy-to-use API
- No external dependencies
- Optimized for performance
- Future integration with contraction_fix and emoticon_fix
Installation
pip install lightlemma
Usage
from lightlemma import lemmatize, stem
# Simple word lemmatization
word = "running"
lemma = lemmatize(word)
print(lemma) # Output: "run"
# Process multiple words with lemmatization
words = ["cats", "running", "better", "studies"]
lemmas = [lemmatize(word) for word in words]
print(lemmas) # Output: ["cat", "run", "good", "study"]
# Using the Porter Stemmer
word = "running"
stemmed = stem(word)
print(stemmed) # Output: "run"
# Compare lemmatization vs stemming
words = ["studies", "universal", "maximum"]
lemmas = [lemmatize(word) for word in words]
stems = [stem(word) for word in words]
print(lemmas) # Output: ["study", "universal", "maximum"]
print(stems) # Output: ["studi", "univers", "maxim"]
Performance
LightLemma is designed to be faster and more memory-efficient than existing solutions while maintaining high accuracy for English text.
Future Features
- Integration with contraction_fix for handling contractions
- Integration with emoticon_fix for emoticon normalization
- Support for additional text normalization features
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lightlemma-0.1.1.tar.gz.
File metadata
- Download URL: lightlemma-0.1.1.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70d6cf20698f14e98d85ca53d307dbb2172f5c23b007b1cba4d04960347f0d68
|
|
| MD5 |
e5962b84c56cee046185f9d15fa4a2e3
|
|
| BLAKE2b-256 |
d1c8cb322be5658f4a8a3e4e53d9c95131a79c400b6a7559f092b3951c052063
|
File details
Details for the file lightlemma-0.1.1-py3-none-any.whl.
File metadata
- Download URL: lightlemma-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba6374973339a4878bb177b7054a66385991d530827abd5969e318f85fb1d61a
|
|
| MD5 |
ae624f9142f332526148f305ddcfd687
|
|
| BLAKE2b-256 |
3229e3a7bd4fa39307d7cefe3735289e46e625c21aa1810b240b54db27b6c1e2
|