A lightweight, fast English lemmatizer

These details have not been verified by PyPI

Project links

Project description

LightLemma

A lightweight, fast English lemmatizer and stemmer. LightLemma focuses on providing high-performance text normalization for English text while maintaining a minimal footprint.

Introduction to Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form (lemma). This process uses morphological analysis and dictionary lookups to transform words into their canonical forms. For example:

"running" → "run"
"better" → "good"
"studies" → "study"
"am", "are", "is" → "be"

Unlike stemming, lemmatization considers the context and part of speech of words to produce linguistically valid results. It uses a dictionary-based approach to ensure the output is always a real word.

The Difference Between Lemmatization and Stemming

While both lemmatization and stemming aim to reduce words to their base form, they work differently:

Lemmatization:

Produces linguistically valid words
Uses dictionary lookup and morphological analysis
Considers word context and part of speech
More accurate but typically slower
Example: "studies" → "study"

Stemming:

Uses rule-based algorithms to strip affixes
Faster but can produce non-words
Doesn't consider word context
More aggressive reduction
Example: "studies" → "studi"

Choose lemmatization when you need linguistically accurate results, and stemming when you need fast, approximate word normalization.

Features

Fast and lightweight English lemmatization
Porter Stemmer implementation
Flexible tokenization functionality
Simple, easy-to-use API
No external dependencies
Optimized for performance
Future integration with contraction_fix and emoticon_fix

Installation

pip install lightlemma

Usage

from lightlemma import lemmatize, stem, tokenize, Tokenizer
from lightlemma import text_to_lemmas, text_to_stems

# Simple word lemmatization
word = "running"
lemma = lemmatize(word)
print(lemma)  # Output: "run"

# Process multiple words with lemmatization
words = ["cats", "running", "better", "studies"]
lemmas = [lemmatize(word) for word in words]
print(lemmas)  # Output: ["cat", "run", "good", "study"]

# Using the Porter Stemmer
word = "running"
stemmed = stem(word)
print(stemmed)  # Output: "run"

# Compare lemmatization vs stemming
words = ["studies", "universal", "maximum"]
lemmas = [lemmatize(word) for word in words]
stems = [stem(word) for word in words]
print(lemmas)  # Output: ["study", "universal", "maximum"]
print(stems)   # Output: ["studi", "univers", "maxim"]

# Using the tokenizer
text = "This is a simple example of tokenization!"
tokens = tokenize(text)
print(tokens)  # Output: ["this", "is", "a", "simple", "example", "of", "tokenization"]

# Advanced tokenization with custom options
tokenizer = Tokenizer(preserve_case=True, preserve_punctuation=True)
custom_tokens = tokenizer.tokenize(text)
print(custom_tokens)  # Output: ["This", "is", "a", "simple", "example", "of", "tokenization", "!"]

# Complete text processing pipeline - manual approach
text = "The cats are running faster than dogs."
tokens = tokenize(text)
lemmas = [lemmatize(token) for token in tokens]
print(lemmas)  # Output: ["the", "cat", "be", "run", "fast", "than", "dog"]

# Using direct text-to-normalized-tokens functions
text = "The cats are running faster than dogs."
# Convert text directly to lemmatized tokens in one step
lemmatized_tokens = text_to_lemmas(text)
print(lemmatized_tokens)  # Output: ["the", "cat", "be", "run", "fast", "than", "dog"]

# Convert text directly to stemmed tokens in one step
stemmed_tokens = text_to_stems(text)
print(stemmed_tokens)  # Output: ["the", "cat", "are", "run", "faster", "than", "dog"]

# Direct conversion with case preservation after lemmatization
lemmatized_tokens = text_to_lemmas(text, preserve_original_case=True)
print(lemmatized_tokens)  # Output: ["The", "cat", "be", "run", "fast", "than", "dog"]

# Direct conversion with case preservation after stemming
stemmed_tokens = text_to_stems("The RUNNING cats", preserve_original_case=True)
print(stemmed_tokens)  # Output: ["The", "RUN", "cat"]

Tokenization Options

The tokenizer provides several options for customizing the tokenization process:

pattern: Custom regex pattern for tokenization
preserve_case: Whether to preserve case of tokens
preserve_urls: Whether to keep URLs as single tokens
preserve_emails: Whether to keep email addresses as single tokens
preserve_numbers: Whether to keep numbers as tokens
preserve_punctuation: Whether to include punctuation as separate tokens

Token Positioning Behavior

The tokenizer processes different token types in a specific order, which affects their position in the final token list:

Option	Default	Description	Position in Result
`preserve_urls`	`False`	Keeps URLs as single tokens instead of breaking them into components	Beginning of token list
`preserve_emails`	`False`	Keeps email addresses as single tokens	After URLs, before words
`preserve_numbers`	`True`	Includes numeric tokens in the output	In their original position among words
`preserve_punctuation`	`False`	Includes punctuation marks as separate tokens	End of token list

This ordering is optimized for common NLP tasks where token type categorization is more important than preserving the exact original sequence. For example, "Hello, world!" with preserve_punctuation=True would tokenize to ["hello", "world", ",", "!"], with punctuation at the end.

Text Processing Pipeline Functions

LightLemma provides convenient functions that process text directly to normalized tokens in a single step:

text_to_lemmas(text, tokenizer_options=None, preserve_original_case=False): Converts raw text directly to lemmatized tokens
text_to_stems(text, tokenizer_options=None, preserve_original_case=False): Converts raw text directly to stemmed tokens

These functions accept the following parameters:

text: The input text to process
tokenizer_options: Optional dictionary of tokenizer settings for customizing tokenization
preserve_original_case: If True, maintains the original case pattern of tokens after lemmatization/stemming

The case preservation feature allows you to maintain the original capitalization of words even after they've been lemmatized or stemmed. This is particularly useful for proper nouns, title case text, or when you need to preserve the original formatting of the text.

Performance

LightLemma is designed to be faster and more memory-efficient than existing solutions while maintaining high accuracy for English text.

Future Features

Integration with contraction_fix for handling contractions
Integration with emoticon_fix for emoticon normalization
Support for additional text normalization features

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.6

Aug 7, 2025

This version

0.1.5

Jul 17, 2025

0.1.4

Jul 17, 2025

0.1.3

Jul 17, 2025

0.1.2

Apr 14, 2025

0.1.1

Apr 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightlemma-0.1.5.tar.gz (22.6 kB view details)

Uploaded Jul 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lightlemma-0.1.5-py3-none-any.whl (16.9 kB view details)

Uploaded Jul 17, 2025 Python 3

File details

Details for the file lightlemma-0.1.5.tar.gz.

File metadata

Download URL: lightlemma-0.1.5.tar.gz
Upload date: Jul 17, 2025
Size: 22.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for lightlemma-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`7924aaa1b7338cae3bda851686b38bc9100fccd4034d5520997dbfd47bf95e6a`
MD5	`9e9a46ac2689092d14fda5c174080bb6`
BLAKE2b-256	`e7d6c84cbf2189e566dd2770e303ef1969691ace59e60f98134fd5f3d60a23e0`

See more details on using hashes here.

File details

Details for the file lightlemma-0.1.5-py3-none-any.whl.

File metadata

Download URL: lightlemma-0.1.5-py3-none-any.whl
Upload date: Jul 17, 2025
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for lightlemma-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b145754e3a3dc3e14a1eaddcc97146b8c438d05f0f45f5f7262f1ef9a42081b5`
MD5	`a27febb786dfbe17091d0399a3ed679c`
BLAKE2b-256	`5c8f3827e37504298ab22f6a1a7bf5cb8f5ad6df62dc7c8f11389ad1e9c2fa87`

See more details on using hashes here.

lightlemma 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LightLemma

Introduction to Lemmatization

The Difference Between Lemmatization and Stemming

Features

Installation

Usage

Tokenization Options

Token Positioning Behavior

Text Processing Pipeline Functions

Performance

Future Features

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes