Package that simplifies the use of stop words lists with Python NLP projects in 36 languages.
Project description
Got Stop Words
Python package that makes it easy to use stop words lists in Python projects. The set of lists contained within the package reflect an organization of lists collected across the Internet. Lists are available for 36 unique languages, with multiple lists available for a number of languages including English, Spanish and Hindi. As expected, different lists for the same language have different, albeit overlapping, sets of words. Lists are divided into two banks:
nltk
: These stop words lists are sourced from the Natural Language Toolkit website.other
: This is a collection of stop words lists gathered from various sources.
Bank | # of Lists | # of Unique Languages in Bank |
---|---|---|
nltk |
29 | 29 |
other |
27 | 25 |
As mentioned, there are lists for 36 unique languages across both banks.
nltk
Bank Available Languages
29 stop words lists for 29 unique languages are available in the nltk
bank.
- Arabic
- Azerbaijani
- Basque
- Bengali
- Catalan
- Chinese
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hebrew
- Hinglish
- Hungarian
- Indonesian
- Italian
- Kazakh
- Nepali
- Norwegian
- Portuguese
- Romanian
- Russian
- Slovene
- Spanish
- Swedish
- Tajik
- Turkish
other
Bank Available Languages
27 stop words lists for 25 unique languages are available in the other
bank.
- Arabic
- Armenian
- Bulgarian
- Chinese
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hindi 1
- Hindi 2
- Indonesian
- Italian
- Japanese
- Latvian
- Norwegian
- Persian
- Polish 1
- Polish 2
- Portuguese
- Romanian
- Russian
- Spanish
- Swedish
- Turkish
Installation
pip install gotstopwords
Usage
Importing the Package
from gotstopwords import gotstopwords
load
Method
The load
method is used to load a stop words list with the following parameters:
bank
: The name of the list's bank,nltk
orother
.lang
: The name of the language as spelled in English, e.g.norwegian
, or the language's two-letter ISO 639-1 code. See below for a table of ISO 639-1 codes.list_num
: The number of the desired list for those languages with more than 1 list in a bank, such as Hindi and Polish in theother
bank. Thelist_num
parameter can be omitted for those languages with only a single list.
Examples
- Loading the stop words list for Finnish, ISO 639-1 code
fi
, from thenltk
bank.
_finnish = gotstopwords.load("nltk", "fi")
# or
_finnish = gotstopwords.load("nltk", "finnish")
- Loading the stop words list for Spanish, ISO 639-1 code
es
, from thenltk
bank.
_spanish = gotstopwords.load("nltk", "es")
# or
_spanish = gotstopwords.load("nltk", "spanish")
- Loading the stop words list for English, ISO 639-1 code
en
, from theother
bank.
_english = gotstopwords.load("other", "en")
# or
_english = gotstopwords.load("other", "english")
- Loading the first stop words list for Hindi, ISO 639-1 code
hi
, from theother
bank.
_hindi1 = gotstopwords.load("other", "hi", "1")
# or
_hindi1 = gotstopwords.load("other", "hindi", "1")
# or
_hindi1 = gotstopwords.load("other", "hi", 1)
# or
_hindi1 = gotstopwords.load("other", "hindi", 1)
Stop words lists are returned as a Python list. If there is no stop words list associated with the values that are input, an empty list will be returned.
Note: Bank and language names can also be entered with capital letters if desired.
ISO 639-1 Language Codes
Note: There is no ISO 639-1 code for Hinglish. However, the package permits specification of the Hinglish stop words list using the 2-character code
hn
.
ISO 639-1 Code | Language |
---|---|
ar | arabic |
az | azerbaijani |
bg | bulgarian |
bn | bengali |
ca | catalan |
da | danish |
de | german |
el | greek |
en | english |
es | spanish |
eu | basque |
fa | persian |
fi | finnish |
fr | french |
he | hebrew |
hi | hindi |
hu | hungarian |
hy | armenian |
id | indonesian |
it | italian |
ja | japanese |
kk | kazakh |
lv | latvian |
ne | nepali |
nl | dutch |
no | norwegian |
pl | polish |
pt | portuguese |
ro | romanian |
ru | russian |
sl | sloveve |
sv | swedish |
tg | tajik |
tr | turkish |
zh | chinese |
Sources
NLTK word lists are obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
License
This project is licensed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gotstopwords-1.0.7.tar.gz
.
File metadata
- Download URL: gotstopwords-1.0.7.tar.gz
- Upload date:
- Size: 62.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45f427dd0a6b0293fd5731ecad85501dab4adb347e61ae69dbd09a97f7e184a2 |
|
MD5 | 336a128c232db0ce7796671fd08a9732 |
|
BLAKE2b-256 | 50e20e36d1e583ecc4b0bfd51c9929368a496311911f851aae816bbe92282a02 |
File details
Details for the file gotstopwords-1.0.7-py3-none-any.whl
.
File metadata
- Download URL: gotstopwords-1.0.7-py3-none-any.whl
- Upload date:
- Size: 68.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa0c61acdbaf7c634f61e3d9dd3d26db82b81beb1ada3ba5f973d1d755532c70 |
|
MD5 | fcf7fff406b43c5f384a98bdeebf8bc5 |
|
BLAKE2b-256 | f19e6231d807c16cf9639f1f930b8a2a955aefbe22170afa715f20888290a542 |