Vietnamese NLP Toolkit
Project description
Underthesea - Vietnamese NLP Toolkit
underthesea is a suite of open source Python modules, data sets and tutorials supporting research and development in Vietnamese Natural Language Processing.
Free software: GNU General Public License v3
Documentation: https://underthesea.readthedocs.io
Live demo: underthesea app
Facebook Page: https://www.facebook.com/undertheseanlp/
Installation
To install underthesea, simply:
$ pip install underthesea==1.1.8a0
✨🍰✨
Satisfaction, guaranteed.
Usage
1. Word Segmentation
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_tokenize
>>> sentence = 'Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò'
>>> word_tokenize(sentence)
['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm', 'sò']
>>> word_tokenize(sentence, format="text")
'Chàng_trai 9X Quảng_Trị khởi_nghiệp từ nấm sò'
2. POS Tagging
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> pos_tag('Chợ thịt chó nổi tiếng ở Sài Gòn bị truy quét')
[('Chợ', 'N'),
('thịt', 'N'),
('chó', 'N'),
('nổi tiếng', 'A'),
('ở', 'E'),
('Sài Gòn', 'Np'),
('bị', 'V'),
('truy quét', 'V')]
3. Chunking
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> text = 'Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?'
>>> chunk(text)
[('Bác sĩ', 'N', 'B-NP'),
('bây giờ', 'P', 'I-NP'),
('có thể', 'R', 'B-VP'),
('thản nhiên', 'V', 'I-VP'),
('báo tin', 'N', 'B-NP'),
('bệnh nhân', 'N', 'I-NP'),
('bị', 'V', 'B-VP'),
('ung thư', 'N', 'I-VP'),
('?', 'CH', 'O')]
4. Named Entity Recognition
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import ner
>>> text = 'Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump'
>>> ner(text)
[('Chưa', 'R', 'O', 'O'),
('tiết lộ', 'V', 'B-VP', 'O'),
('lịch trình', 'V', 'B-VP', 'O'),
('tới', 'E', 'B-PP', 'O'),
('Việt Nam', 'Np', 'B-NP', 'B-LOC'),
('của', 'E', 'B-PP', 'O'),
('Tổng thống', 'N', 'B-NP', 'O'),
('Mỹ', 'Np', 'B-NP', 'B-LOC'),
('Donald', 'Np', 'B-NP', 'B-PER'),
('Trump', 'Np', 'B-NP', 'I-PER')]
5. Text Classification
Install dependencies and download default model
$ pip install Cython
$ pip install joblib future scipy numpy scikit-learn
$ pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
$ underthesea data
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import classify
>>> classify('HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu')
['The thao']
>>> classify('Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế')
['Kinh doanh']
>>> classify('Đánh giá “rạp hát tại gia” Samsung Soundbar Sound+ MS750')
['Vi tinh']
6. Sentiment Analysis
Install dependencies
$ pip install future scipy numpy scikit-learn==0.19.0 joblib
Usage
>>> # -*- coding: utf-8 -*-
>>> from underthesea import sentiment
>>> sentiment('Gọi mấy lần mà lúc nào cũng là các chuyên viên đang bận hết ạ', domain='bank')
('CUSTOMER SUPPORT#NEGATIVE',)
>>> sentiment('bidv cho vay hay ko phu thuoc y thich cua thang tham dinh, ko co quy dinh ro rang', domain='bank')
('LOAN#NEGATIVE',)
Up Coming Features
Text to Speech
Automatic Speech Recognition
Machine Translation
Dependency Parsing
Contributing
Do you want to contribute with underthesea development? Great! Please read more details at CONTRIBUTING.rst.
History
1.1.8-alpha (2018-05-06)
Fix word_tokenize error when text contains tab (t) character
Fix regex_tokenize with url
1.1.7 (2018-04-12)
Rename word_sent function to word_tokenize
Refactor version control in setup.py file and __init__.py file
Update documentation badge url
1.1.6 (2017-12-26)
New feature: aspect sentiment analysis
Integrate with languageflow 1.1.6
Fix bug tokenize string with ‘=’ (#159)
1.1.5 (2017-10-12)
New feature: named entity recognition
Refactor and update model for word_sent, pos_tag, chunking
1.1.4 (2017-09-12)
New feature: text classification
[bug] Fix Text error
[doc] Add facebook link
1.1.3 (2017-08-30)
Add live demo: https://underthesea.herokuapp.com/
1.1.2 (2017-08-22)
Add dictionary
1.1.1 (2017-07-05)
Support Python 3
Refactor feature_engineering code
1.1.0 (2017-05-30)
Add chunking feature
Add pos_tag feature
Add word_sent feature, fix performance
Add Corpus class
Add Transformer classes
Integrated with dictionary of Ho Ngoc Duc
Add travis-CI, auto build with PyPI
1.0.0 (2017-03-01)
First release on PyPI.
First release on Readthedocs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for underthesea-1.1.8a0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7604264c1a45f9c70da25b15d3536969053baff4a9893bdc2068bb66f47758b8 |
|
MD5 | fcb40e3ed4f6a36e6c9789b3f33aacee |
|
BLAKE2b-256 | 7e81a148ea9d68b2a480513714b485c21e00d876486dcfc1a2dc96733ecff8c3 |