Skip to main content

Vietnamese NLP Toolkit

Project description

Underthesea - Vietnamese NLP Toolkit

https://img.shields.io/pypi/v/underthesea.svg https://img.shields.io/pypi/pyversions/underthesea.svg https://img.shields.io/pypi/l/underthesea.svg https://img.shields.io/travis/magizbox/underthesea.svg Documentation Status Updates https://img.shields.io/badge/chat-on%20facebook-green.svg

[English] [Tiếng Việt]

https://raw.githubusercontent.com/magizbox/underthesea/master/logo.jpg

underthesea is a suite of open source Python modules, data sets and tutorials supporting research and development in Vietnamese Natural Language Processing.

Installation

To install underthesea, simply:

$ pip install underthesea==1.1.6
✨🍰✨

Satisfaction, guaranteed.

Usage

1. Word Segmentation

https://img.shields.io/badge/F1-94%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_tokenize
>>> sentence = u"Chúng ta thường nói đến Rau sạch, Rau an toàn để phân biệt với các rau bình thường bán ngoài chợ."

>>> word_tokenize(sentence)
[u"Chúng ta", u"thường", u"nói", u"đến", u"Rau sạch", u",", u"Rau", u"an toàn", u"để", u"phân biệt", u"với",
u"các", u"rau", u"bình thường", u"bán", u"ngoài", u"chợ", u"."]

>>> word_tokenize(sentence, format="text")
u'Chúng_ta thường nói đến Rau_sạch , Rau an_toàn để phân_biệt với các rau bình_thường bán ngoài chợ .'

2. POS Tagging

https://img.shields.io/badge/accuracy-92.3%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> text = u"Chợ thịt chó nổi tiếng ở TP Hồ Chí Minh bị truy quét"
>>> pos_tag(text)
[(u'Chợ', 'N'),
 (u'thịt', 'N'),
 (u'chó', 'N'),
 (u'nổi tiếng', 'A'),
 (u'ở', 'E'),
 (u'TP HCM', 'Np'),
 (u'bị', 'V'),
 (u'truy quét', 'V')]

3. Chunking

https://img.shields.io/badge/F1-77%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> text = u"Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?"
>>> chunk(text)
[(u'Bác sĩ', 'N', 'B-NP'),
 (u'bây giờ', 'P', 'I-NP'),
 (u'có thể', 'R', 'B-VP'),
 (u'thản nhiên', 'V', 'I-VP'),
 (u'báo tin', 'N', 'B-NP'),
 (u'bệnh nhân', 'N', 'I-NP'),
 (u'bị', 'V', 'B-VP'),
 (u'ung thư', 'N', 'I-VP'),
 (u'?', 'CH', 'O')]

4. Named Entity Recognition

https://img.shields.io/badge/F1-86.6%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import ner
>>> text = u"Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump"
>>> ner(text)
[('Chưa', 'R', 'O', 'O'),
 ('tiết lộ', 'V', 'B-VP', 'O'),
 ('lịch trình', 'V', 'B-VP', 'O'),
 ('tới', 'E', 'B-PP', 'O'),
 ('Việt Nam', 'Np', 'B-NP', 'B-LOC'),
 ('của', 'E', 'B-PP', 'O'),
 ('Tổng thống', 'N', 'B-NP', 'O'),
 ('Mỹ', 'Np', 'B-NP', 'B-LOC'),
 ('Donald', 'Np', 'B-NP', 'B-PER'),
 ('Trump', 'Np', 'B-NP', 'I-PER')]

5. Text Classification

https://img.shields.io/badge/accuracy-86.7%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Install dependencies and download default model

$ pip install Cython
$ pip install future scipy numpy scikit-learn
$ pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
$ underthesea data

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import classify
>>> classify("HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu")
['The thao']
>>> classify("Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế")
['Kinh doanh']
>>> classify("Đánh giá “rạp hát tại gia” Samsung Soundbar Sound+ MS750")
['Vi tinh']

6. Sentiment Analysis

https://img.shields.io/badge/F1-59.5%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Install dependencies

$ pip install future scipy numpy scikit-learn==0.19.0 joblib

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import sentiment
>>> sentiment("Gọi mấy lần mà lúc nào cũng là các chuyên viên đang bận hết ạ")
('CUSTOMER SUPPORT#NEGATIVE',)
>>> sentiment("bidv cho vay hay ko phu thuoc y thich cua thang tham dinh, ko co quy dinh ro rang")
('LOAN#NEGATIVE',)

Up Coming Features

  • Text to Speech

  • Automatic Speech Recognition

  • Machine Translation

  • Dependency Parsing

Contributing

Do you want to contribute with underthesea development? Great! Please read more details at CONTRIBUTING.rst.

History

1.1.7.alpha (2018-03-29)

  • Rename word_sent function to word_tokenize

  • Refactor version control in setup.py file and __init__.py file

  • Update documentation badge url

1.1.6 (2017-12-26)

  • New feature: aspect sentiment analysis

  • Integrate with languageflow 1.1.6

  • Fix bug tokenize string with ‘=’ (#159)

1.1.5 (2017-10-12)

  • New feature: named entity recognition

  • Refactor and update model for word_sent, pos_tag, chunking

1.1.4 (2017-09-12)

  • New feature: text classification

  • [bug] Fix Text error

  • [doc] Add facebook link

1.1.3 (2017-08-30)

1.1.2 (2017-08-22)

  • Add dictionary

1.1.1 (2017-07-05)

  • Support Python 3

  • Refactor feature_engineering code

1.1.0 (2017-05-30)

  • Add chunking feature

  • Add pos_tag feature

  • Add word_sent feature, fix performance

  • Add Corpus class

  • Add Transformer classes

  • Integrated with dictionary of Ho Ngoc Duc

  • Add travis-CI, auto build with PyPI

1.0.0 (2017-03-01)

  • First release on PyPI.

  • First release on Readthedocs

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

underthesea-1.1.7a0.tar.gz (12.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

underthesea-1.1.7a0-py3-none-any.whl (12.0 MB view details)

Uploaded Python 3

File details

Details for the file underthesea-1.1.7a0.tar.gz.

File metadata

  • Download URL: underthesea-1.1.7a0.tar.gz
  • Upload date:
  • Size: 12.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for underthesea-1.1.7a0.tar.gz
Algorithm Hash digest
SHA256 1af9c14a1a8b302852dfd812c6c8394d5e4f5856278682acefaaa9861439d4f1
MD5 2445b89c8ae33699101f7c857d5b41ac
BLAKE2b-256 bae49880f6a153105fb63bb7228f544a9feaca6c57132fcef0adaef09d317ca7

See more details on using hashes here.

File details

Details for the file underthesea-1.1.7a0-py3-none-any.whl.

File metadata

File hashes

Hashes for underthesea-1.1.7a0-py3-none-any.whl
Algorithm Hash digest
SHA256 a52ba48c908229191062e972f2bce075cc9249be2b0a6648850298efff4637fd
MD5 96005564964ba380b1cc3493fbac2d82
BLAKE2b-256 5deeced5fbfc50b9af193527ce3e00920c11694b16fe8084a4a04ab7001ce7be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page