Python library for managing stop words in many languages.
Project description
Python library for managing common stop words in 39 languages.
Usage
Simple
Better than a long speech, here a direct introduction:
>>> from mots_vides import stop_words >>> english_stop_words = stop_words('en') >>> text = """ ... Even though using "lorem ipsum" often arouses curiosity ... due to its resemblance to classical Latin, ... it is not intended to have meaning. ... """ >>> print(english_stop_words.rebase(text)) XXXX XXXXXX XXXXX "lorem ipsum" XXXXX arouses curiosity XXX XX XXX resemblance XX classical Latin, XX XX XXX intended XX XXXX meaning. >>> print(english_stop_words.rebase(text, '').split()) ['"lorem', 'ipsum"', 'arouses', 'curiosity', 'resemblance', 'classical', 'Latin,', 'intended', 'meaning.']
Advanced
Mots vides also provides two classes for managing the stop words in your language.
StopWord which is a container for a collection of stop words. By default is language agnostic, but can be easily manipulated to create the collection:
>>> from mots_vides import StopWord >>> french_stop_words = StopWord('french', ['le', 'la', 'les']) >>> french_stop_words += StopWord('french', ['un', 'une', 'des']) >>> french_stop_words += ['or', 'ni', 'car'] >>> french_stop_words += 'assez' >>> french_stop_words += u'aussitôt' >>> print(sorted(french_stop_words)) ['assez', u'aussitôt', 'car', 'des', 'la', 'le', 'les', 'ni', 'or', 'un', 'une']
StopWordFactory is a factory for initializing StopWord objects by language and the appropriate collection of stop words.
>>> from mots_vides import StopWordFactory >>> factory = StopWordFactory() >>> french_stop_words = factory.get_stop_words('french') >>> print(len(french_stop_words)) 577
You can also use international language code to query a collection:
>>> french_stop_words = factory.get_stop_words('fr') >>> print(len(french_stop_words)) 577
If the required language does not exist a StopWordError is raised, unless the fail_safe parameter is set to True:
>>> klingon_stop_words = factory.get_stop_words('klingon') StopWordError: Stop words are not available in "klingon". >>> klingon_stop_words = factory.get_stop_words('klingon', fail_safe=True) >>> print(len(klingon_stop_words)) 0
Supported languages
Arabic
Armenian
Basque
Bengali
Bulgarian
Catalan
Chinese
Czech
Danish
Dutch
English
Finnish
French
Galician
German
Greek
Hindi
Hungarian
Indonesian
Irish
Italian
Japanese
Korean
Latvian
Lithuanian
Marathi
Norwegian
Persian
Polish
Portuguese
Romanian
Russian
Slovak
Spanish
Swedish
Thai
Turkish
Ukrainian
Urdu
Compatibility
Tested with Python 2.6, 2.7, 3.2, 3.3, 3.4.
Notes
Mots vides means stop words in french.
Inspired from https://github.com/Alir3z4/python-stop-words
Changelog
2015.5.11
Fix cache system for Python 3
2015.2.6
Fix potential issue in factory.get_available_languages
2015.2.5
Fix packaging
Add a rebaser command script
2015.2.4
Initial release
2015.1.21.dev0
Development release
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mots_vides-2015.5.11-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c00af05234f4021396c6d888c8e34142cfe880fe732ff063f6cfad2d6342dc8 |
|
MD5 | 609dbfa50fbd094feefcfd2964faaa87 |
|
BLAKE2b-256 | 9534f5a4ec9cfad0e484b087de46e381efc991d5fde07412de51b85f59853ed7 |