Skip to main content

Get list of common stop words in various languages in Python

Project description

PyPI version Python versions License

Overview

A Python library providing curated lists of stop words across 34+ languages. Stop words are common words (like “the”, “is”, “at”) that are typically filtered out in natural language processing and text analysis tasks.

Key Features:

  • 34+ Languages - Extensive language support.

  • Performance - Built-in caching for fast repeated access.

  • Flexible - Custom filtering system for advanced use cases.

  • Zero Dependencies - Lightweight with no external requirements.

Available Languages

All the available languages supported by https://github.com/Alir3z4/stop-words

Each language is identified by both its ISO 639-1 language code (e.g., en) and full name (e.g., english).

Installation

Via pip (Recommended):

$ pip install stop-words

Via Git:

$ git clone --recursive https://github.com/Alir3z4/python-stop-words.git
$ cd python-stop-words
$ pip install -e .

Requirements:

  • Usually any version of Python that supports type hints and probably has not been marked as EOL.

Quick Start

Basic Usage

from stop_words import get_stop_words

# Get English stop words using language code
stop_words = get_stop_words('en')

# Or use the full language name
stop_words = get_stop_words('english')

# Use in text processing
text = "The quick brown fox jumps over the lazy dog"
words = text.lower().split()
filtered_words = [word for word in words if word not in stop_words]
print(filtered_words)  # ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Safe Loading

Use safe_get_stop_words() when you’re not sure if a language is supported:

from stop_words import safe_get_stop_words

# Returns empty list instead of raising an exception
stop_words = safe_get_stop_words('klingon')  # Returns []

# Works normally with supported languages
stop_words = safe_get_stop_words('fr')  # Returns French stop words

Advanced Usage

Checking Available Languages

from stop_words import AVAILABLE_LANGUAGES, LANGUAGE_MAPPING

# List all available languages
print(AVAILABLE_LANGUAGES)
# ['arabic', 'bulgarian', 'catalan', ...]

# View language code mappings
print(LANGUAGE_MAPPING)
# {'en': 'english', 'fr': 'french', ...}

Caching Control

By default, stop words are cached for performance. You can control this behavior:

from stop_words import get_stop_words, STOP_WORDS_CACHE

# Disable caching for this call
stop_words = get_stop_words('en', cache=False)

# Clear the cache manually
STOP_WORDS_CACHE.clear()

# Check what's cached
print(STOP_WORDS_CACHE.keys())  # ['english', 'french', ...]

Custom Filters

Apply custom transformations to stop words using the filter system:

from stop_words import get_stop_words, add_filter, remove_filter

# Add a global filter (applies to all languages)
def remove_short_words(words, language):
    """Remove words shorter than 3 characters."""
    return [w for w in words if len(w) >= 3]

add_filter(remove_short_words)
stop_words = get_stop_words('en', cache=False)

# Add a language-specific filter
def uppercase_words(words):
    """Convert all words to uppercase."""
    return [w.upper() for w in words]

add_filter(uppercase_words, language='english')
stop_words = get_stop_words('en', cache=False)

# Remove a filter when done
remove_filter(uppercase_words, language='english')

Note: Filters only apply to newly loaded stop words, not cached ones. Use cache=False or clear the cache to apply new filters.

Practical Examples

Text Preprocessing

from stop_words import get_stop_words
import re

def preprocess_text(text, language='en'):
    """Clean and filter text for NLP tasks."""
    stop_words = set(get_stop_words(language))

    # Convert to lowercase and extract words
    words = re.findall(r'\b\w+\b', text.lower())

    # Remove stop words
    filtered_words = [w for w in words if w not in stop_words]

    return filtered_words

text = "The quick brown fox jumps over the lazy dog"
print(preprocess_text(text))
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Multilingual Processing

from stop_words import get_stop_words

def filter_multilingual_text(texts_dict):
    """Process texts in multiple languages.

    Args:
        texts_dict: Dictionary mapping language codes to text strings

    Returns:
        Dictionary with filtered words for each language
    """
    results = {}

    for lang_code, text in texts_dict.items():
        stop_words = set(get_stop_words(lang_code))
        words = text.lower().split()
        filtered = [w for w in words if w not in stop_words]
        results[lang_code] = filtered

    return results

texts = {
    'en': 'The cat is on the table',
    'fr': 'Le chat est sur la table',
    'es': 'El gato está en la mesa'
}

print(filter_multilingual_text(texts))

Keyword Extraction

from stop_words import get_stop_words
from collections import Counter
import re

def extract_keywords(text, language='en', top_n=10):
    """Extract the most common meaningful words from text."""
    stop_words = set(get_stop_words(language))

    # Extract words and filter
    words = re.findall(r'\b\w+\b', text.lower())
    meaningful_words = [w for w in words if w not in stop_words and len(w) > 2]

    # Count and return top keywords
    word_counts = Counter(meaningful_words)
    return word_counts.most_common(top_n)

article = """
Python is a high-level programming language. Python is known for its
simplicity and readability. Many developers choose Python for data science.
"""

keywords = extract_keywords(article)
print(keywords)
# [('python', 3), ('language', 1), ('high-level', 1), ...]

API Reference

Functions

get_stop_words(language, *, cache=True)

Load stop words for a specified language.

Parameters:

  • language (str): Language code (e.g., ‘en’) or full name (e.g., ‘english’)

  • cache (bool, optional): Enable caching. Defaults to True.

Returns:

  • list[str]: List of stop words

Raises:

  • StopWordError: If language is unavailable or files are unreadable

Example:

stop_words = get_stop_words('en')
stop_words = get_stop_words('french', cache=False)

safe_get_stop_words(language)

Safely load stop words, returning empty list on error.

Parameters:

  • language (str): Language code or full name

Returns:

  • list[str]: Stop words, or empty list if unavailable

Example:

stop_words = safe_get_stop_words('unknown')  # Returns []

add_filter(func, language=None)

Register a filter function for stop word post-processing.

Parameters:

  • func (Callable): Filter function

  • language (str | None, optional): Language code or None for global filter

Filter Signatures:

  • Language-specific: func(stopwords: list[str]) -> list[str]

  • Global: func(stopwords: list[str], language: str) -> list[str]

Example:

def remove_short(words, lang):
    return [w for w in words if len(w) > 3]

add_filter(remove_short)  # Global filter

remove_filter(func, language=None)

Remove a previously registered filter.

Parameters:

  • func (Callable): The filter function to remove

  • language (str | None, optional): Language code or None

Returns:

  • bool: True if removed, False if not found

Example:

success = remove_filter(my_filter, language='english')

Constants

AVAILABLE_LANGUAGES

List of all supported language names.

['arabic', 'bulgarian', 'catalan', ...]

LANGUAGE_MAPPING

Dictionary mapping language codes to full names.

{'en': 'english', 'fr': 'french', 'de': 'german', ...}

STOP_WORDS_CACHE

Dictionary storing cached stop words. Can be manually cleared.

STOP_WORDS_CACHE.clear()  # Clear all cached data

Exceptions

StopWordError

Raised when a language is unavailable or files cannot be read.

try:
    stop_words = get_stop_words('invalid')
except StopWordError as e:
    print(f"Error: {e}")

Performance Tips

  1. Use caching - Keep cache=True (default) for repeated access to the same language

  2. Reuse stop word sets - Convert to set() once for O(1) lookup performance:

    stop_words_set = set(get_stop_words('en'))
    # Fast membership testing
    is_stop_word = 'the' in stop_words_set
  3. Preload languages - Load stop words during initialization, not in tight loops

  4. Use safe_get_stop_words - Avoid try/except overhead when language availability is uncertain

Troubleshooting

“Language unavailable” error

  • Check spelling and use either the language code or full name

  • Verify the language is in AVAILABLE_LANGUAGES

  • See the Available Languages table above

“File is unreadable” error

  • Ensure the package installed correctly: pip install --force-reinstall stop-words

  • Check file permissions in the installation directory

  • Verify the stop-words subdirectory exists in the package

Filters not applying

  • Filters only affect newly loaded stop words

  • Clear the cache: STOP_WORDS_CACHE.clear()

  • Use cache=False when testing filters

Performance issues

  • Ensure caching is enabled (default behavior)

  • Convert stop word lists to sets for faster lookups

  • Preload stop words outside of loops

Contributing

Contributions are welcome! Here’s how you can help:

  1. Add new languages - Submit stop word lists for unsupported languages via https://github.com/Alir3z4/stop-words

  2. Improve existing lists - Suggest additions or removals for existing languages via https://github.com/Alir3z4/stop-words

  3. Report bugs - Open issues on GitHub

  4. Submit PRs - Fix bugs or add features

Repository: https://github.com/Alir3z4/python-stop-words

License

This project is licensed under the BSD 3-Clause License. See LICENSE file for details.

Changelog

See ChangeLog.rst for version history.

Support

Credits

Indices and Tables

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stop_words-2025.11.4.tar.gz (68.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stop_words-2025.11.4-py3-none-any.whl (59.5 kB view details)

Uploaded Python 3

File details

Details for the file stop_words-2025.11.4.tar.gz.

File metadata

  • Download URL: stop_words-2025.11.4.tar.gz
  • Upload date:
  • Size: 68.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stop_words-2025.11.4.tar.gz
Algorithm Hash digest
SHA256 0459072b54b11e43a6fb4c5b05bda87d2accfc4f14c1697974f3739af0f7b43d
MD5 0f8bbd9b602626c4c1268bbb01f781e9
BLAKE2b-256 b7cb27ee3d3e0b7b1169269e83331c075b2dd3c4bcc1a005821174c32a273dc4

See more details on using hashes here.

Provenance

The following attestation bundles were made for stop_words-2025.11.4.tar.gz:

Publisher: pypi.yml on Alir3z4/python-stop-words

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stop_words-2025.11.4-py3-none-any.whl.

File metadata

  • Download URL: stop_words-2025.11.4-py3-none-any.whl
  • Upload date:
  • Size: 59.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stop_words-2025.11.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b3fc0722e42b722a9350aad59a8ba5850085a5b45a4ba9de390b4f5c4b86df25
MD5 5e13fc1507a7286b246df694ab2ace83
BLAKE2b-256 fcf5992d668d21590ed39c6a9d1c62220e9b4b086a165e15fcb7580764cc7ceb

See more details on using hashes here.

Provenance

The following attestation bundles were made for stop_words-2025.11.4-py3-none-any.whl:

Publisher: pypi.yml on Alir3z4/python-stop-words

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page