Skip to main content

Package that simplifies the use of stop words lists with Python NLP projects in 36 languages.

Project description

Got Stop Words

Python package that makes it easy to use stop words lists in Python projects. The set of lists contained within the package reflect an organization of lists collected across the Internet. Lists are available for 36 unique languages, with multiple lists available for a number of languages including English, Spanish and Hindi. As expected, different lists for the same language have different, albeit overlapping, sets of words. Lists are divided into two banks:

  1. nltk: These stop words lists are sourced from the Natural Language Toolkit website.
  2. other: This is a collection of stop words lists gathered from various sources.
Bank # of Lists # of Unique Languages in Bank
nltk 29 29
other 27 25

As mentioned, there are lists for 36 unique languages across both banks.

nltk Bank Available Languages

29 stop words lists for 29 unique languages are available in the nltk bank.

  • Arabic
  • Azerbaijani
  • Basque
  • Bengali
  • Catalan
  • Chinese
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Hebrew
  • Hinglish
  • Hungarian
  • Indonesian
  • Italian
  • Kazakh
  • Nepali
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Slovene
  • Spanish
  • Swedish
  • Tajik
  • Turkish

other Bank Available Languages

27 stop words lists for 25 unique languages are available in the other bank.

  • Arabic
  • Armenian
  • Bulgarian
  • Chinese
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Hindi 1
  • Hindi 2
  • Indonesian
  • Italian
  • Japanese
  • Latvian
  • Norwegian
  • Persian
  • Polish 1
  • Polish 2
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

Installation

pip install gotstopwords

Usage

Importing the Package

from gotstopwords import gotstopwords

load Method

The load method is used to load a stop words list with the following parameters:

  • bank: The name of the list's bank, nltk or other.
  • lang: The name of the language as spelled in English, e.g. norwegian, or the language's two-letter ISO 639-1 code. See below for a table of ISO 639-1 codes.
  • list_num: The number of the desired list for those languages with more than 1 list in a bank, such as Hindi and Polish in the other bank. The list_num parameter can be omitted for those languages with only a single list.

Examples

  • Loading the stop words list for Finnish, ISO 639-1 code fi, from the nltk bank.
_finnish = gotstopwords.load("nltk", "fi")

# or

_finnish = gotstopwords.load("nltk", "finnish")
  • Loading the stop words list for Spanish, ISO 639-1 code es, from the nltk bank.
_spanish = gotstopwords.load("nltk", "es")

# or

_spanish = gotstopwords.load("nltk", "spanish")
  • Loading the stop words list for English, ISO 639-1 code en, from the other bank.
_english = gotstopwords.load("other", "en")

# or

_english = gotstopwords.load("other", "english")
  • Loading the first stop words list for Hindi, ISO 639-1 code hi, from the other bank.
_hindi1 = gotstopwords.load("other", "hi", "1")

# or

_hindi1 = gotstopwords.load("other", "hindi", "1")

# or

_hindi1 = gotstopwords.load("other", "hi", 1)

# or

_hindi1 = gotstopwords.load("other", "hindi", 1)

Stop words lists are returned as a Python list. If there is no stop words list associated with the values that are input, an empty list will be returned.

Note: Bank and language names can also be entered with capital letters if desired.

ISO 639-1 Language Codes

Note: There is no ISO 639-1 code for Hinglish. However, the package permits specification of the Hinglish stop words list using the 2-character code hn.

ISO 639-1 Code Language
ar arabic
az azerbaijani
bg bulgarian
bn bengali
ca catalan
da danish
de german
el greek
en english
es spanish
eu basque
fa persian
fi finnish
fr french
he hebrew
hi hindi
hu hungarian
hy armenian
id indonesian
it italian
ja japanese
kk kazakh
lv latvian
ne nepali
nl dutch
no norwegian
pl polish
pt portuguese
ro romanian
ru russian
sl sloveve
sv swedish
tg tajik
tr turkish
zh chinese

Sources

NLTK word lists are obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/

License

This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gotstopwords-1.0.7.tar.gz (62.7 kB view details)

Uploaded Source

Built Distribution

gotstopwords-1.0.7-py3-none-any.whl (68.0 kB view details)

Uploaded Python 3

File details

Details for the file gotstopwords-1.0.7.tar.gz.

File metadata

  • Download URL: gotstopwords-1.0.7.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for gotstopwords-1.0.7.tar.gz
Algorithm Hash digest
SHA256 45f427dd0a6b0293fd5731ecad85501dab4adb347e61ae69dbd09a97f7e184a2
MD5 336a128c232db0ce7796671fd08a9732
BLAKE2b-256 50e20e36d1e583ecc4b0bfd51c9929368a496311911f851aae816bbe92282a02

See more details on using hashes here.

File details

Details for the file gotstopwords-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: gotstopwords-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 68.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for gotstopwords-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 fa0c61acdbaf7c634f61e3d9dd3d26db82b81beb1ada3ba5f973d1d755532c70
MD5 fcf7fff406b43c5f384a98bdeebf8bc5
BLAKE2b-256 f19e6231d807c16cf9639f1f930b8a2a955aefbe22170afa715f20888290a542

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page