Vietnamese NLP Toolkit
Project description
Underthesea - Vietnamese NLP Toolkit
underthesea is a suite of open source Python modules, data sets and tutorials supporting research and development in Vietnamese Natural Language Processing.
Free software: GNU General Public License v3
Documentation: https://underthesea.readthedocs.io
Live demo: undertheseanlp.com
Facebook Page: https://www.facebook.com/undertheseanlp/
Youtube: Underthesea NLP Channel
Installation
To install underthesea, simply:
$ pip install underthesea==1.1.9
✨🍰✨
Satisfaction, guaranteed.
Usage
1. Word Segmentation
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_tokenize
>>> sentence = 'Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò'
>>> word_tokenize(sentence)
['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm', 'sò']
>>> word_tokenize(sentence, format="text")
'Chàng_trai 9X Quảng_Trị khởi_nghiệp từ nấm sò'
2. POS Tagging
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> pos_tag('Chợ thịt chó nổi tiếng ở Sài Gòn bị truy quét')
[('Chợ', 'N'),
('thịt', 'N'),
('chó', 'N'),
('nổi tiếng', 'A'),
('ở', 'E'),
('Sài Gòn', 'Np'),
('bị', 'V'),
('truy quét', 'V')]
3. Chunking
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> text = 'Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?'
>>> chunk(text)
[('Bác sĩ', 'N', 'B-NP'),
('bây giờ', 'P', 'I-NP'),
('có thể', 'R', 'B-VP'),
('thản nhiên', 'V', 'I-VP'),
('báo tin', 'N', 'B-NP'),
('bệnh nhân', 'N', 'I-NP'),
('bị', 'V', 'B-VP'),
('ung thư', 'N', 'I-VP'),
('?', 'CH', 'O')]
4. Named Entity Recognition
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import ner
>>> text = 'Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump'
>>> ner(text)
[('Chưa', 'R', 'O', 'O'),
('tiết lộ', 'V', 'B-VP', 'O'),
('lịch trình', 'V', 'B-VP', 'O'),
('tới', 'E', 'B-PP', 'O'),
('Việt Nam', 'Np', 'B-NP', 'B-LOC'),
('của', 'E', 'B-PP', 'O'),
('Tổng thống', 'N', 'B-NP', 'O'),
('Mỹ', 'Np', 'B-NP', 'B-LOC'),
('Donald', 'Np', 'B-NP', 'B-PER'),
('Trump', 'Np', 'B-NP', 'I-PER')]
5. Text Classification
Install dependencies and download default model
$ pip install Cython
$ pip install joblib future scipy numpy scikit-learn
$ pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
$ underthesea data
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import classify
>>> classify('HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu')
['The thao']
>>> classify('Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế')
['Kinh doanh']
>>> classify('Đánh giá “rạp hát tại gia” Samsung Soundbar Sound+ MS750')
['Vi tinh']
6. Sentiment Analysis
Install dependencies
$ pip install future scipy numpy scikit-learn==0.19.2 joblib
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import sentiment
>>> sentiment('Gọi mấy lần mà lúc nào cũng là các chuyên viên đang bận hết ạ', domain='bank')
('CUSTOMER SUPPORT#NEGATIVE',)
>>> sentiment('bidv cho vay hay ko phu thuoc y thich cua thang tham dinh, ko co quy dinh ro rang', domain='bank')
('LOAN#NEGATIVE',)
Up Coming Features
Text to Speech
Automatic Speech Recognition
Machine Translation
Dependency Parsing
Contributing
Do you want to contribute with underthesea development? Great! Please read more details at CONTRIBUTING.rst.
History
1.1.9 (2019-01-01)
Improve speed of word_tokenize function
Only support python 3.6+
Use flake8 for style guide enforcement
1.1.8 (2018-06-20)
Fix word_tokenize error when text contains tab (t) character
Fix regex_tokenize with url
1.1.7 (2018-04-12)
Rename word_sent function to word_tokenize
Refactor version control in setup.py file and __init__.py file
Update documentation badge url
1.1.6 (2017-12-26)
New feature: aspect sentiment analysis
Integrate with languageflow 1.1.6
Fix bug tokenize string with ‘=’ (#159)
1.1.5 (2017-10-12)
New feature: named entity recognition
Refactor and update model for word_sent, pos_tag, chunking
1.1.4 (2017-09-12)
New feature: text classification
[bug] Fix Text error
[doc] Add facebook link
1.1.3 (2017-08-30)
Add live demo: https://underthesea.herokuapp.com/
1.1.2 (2017-08-22)
Add dictionary
1.1.1 (2017-07-05)
Support Python 3
Refactor feature_engineering code
1.1.0 (2017-05-30)
Add chunking feature
Add pos_tag feature
Add word_sent feature, fix performance
Add Corpus class
Add Transformer classes
Integrated with dictionary of Ho Ngoc Duc
Add travis-CI, auto build with PyPI
1.0.0 (2017-03-01)
First release on PyPI.
First release on Readthedocs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for underthesea-1.1.11-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 075ab28fcfd689dd14fe45fde7dba1468e82e1e3cd7ca2a3c436e0d26cc0be04 |
|
MD5 | 34eeb886f16e181a67a15875918458f4 |
|
BLAKE2b-256 | 27dc279c4dfca8fe97ba70c61a9e8d94dfaebaeb73899500ac0703a20859e786 |