Skip to main content

Vietnamese NLP Toolkit

Project description

Underthesea - Vietnamese NLP Toolkit

https://img.shields.io/pypi/v/underthesea.svg https://img.shields.io/pypi/pyversions/underthesea.svg https://img.shields.io/badge/license-GNU%20General%20Public%20License%20v3-brightgreen.svg https://img.shields.io/travis/undertheseanlp/underthesea.svg Documentation Status https://img.shields.io/badge/chat-on%20facebook-green.svg

https://raw.githubusercontent.com/undertheseanlp/underthesea/master/logo.jpg

underthesea is a suite of open source Python modules, data sets and tutorials supporting research and development in Vietnamese Natural Language Processing.

Installation

To install underthesea, simply:

$ pip install underthesea==1.1.8
✨🍰✨

Satisfaction, guaranteed.

Usage

1. Word Segmentation

https://img.shields.io/badge/F1-94%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_tokenize
>>> sentence = 'Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò'

>>> word_tokenize(sentence)
['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm', 'sò']

>>> word_tokenize(sentence, format="text")
'Chàng_trai 9X Quảng_Trị khởi_nghiệp từ nấm sò'

2. POS Tagging

https://img.shields.io/badge/accuracy-92.3%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> pos_tag('Chợ thịt chó nổi tiếng ở Sài Gòn bị truy quét')
[('Chợ', 'N'),
 ('thịt', 'N'),
 ('chó', 'N'),
 ('nổi tiếng', 'A'),
 ('ở', 'E'),
 ('Sài Gòn', 'Np'),
 ('bị', 'V'),
 ('truy quét', 'V')]

3. Chunking

https://img.shields.io/badge/F1-77%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> text = 'Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?'
>>> chunk(text)
[('Bác sĩ', 'N', 'B-NP'),
 ('bây giờ', 'P', 'I-NP'),
 ('có thể', 'R', 'B-VP'),
 ('thản nhiên', 'V', 'I-VP'),
 ('báo tin', 'N', 'B-NP'),
 ('bệnh nhân', 'N', 'I-NP'),
 ('bị', 'V', 'B-VP'),
 ('ung thư', 'N', 'I-VP'),
 ('?', 'CH', 'O')]

4. Named Entity Recognition

https://img.shields.io/badge/F1-86.6%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import ner
>>> text = 'Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump'
>>> ner(text)
[('Chưa', 'R', 'O', 'O'),
 ('tiết lộ', 'V', 'B-VP', 'O'),
 ('lịch trình', 'V', 'B-VP', 'O'),
 ('tới', 'E', 'B-PP', 'O'),
 ('Việt Nam', 'Np', 'B-NP', 'B-LOC'),
 ('của', 'E', 'B-PP', 'O'),
 ('Tổng thống', 'N', 'B-NP', 'O'),
 ('Mỹ', 'Np', 'B-NP', 'B-LOC'),
 ('Donald', 'Np', 'B-NP', 'B-PER'),
 ('Trump', 'Np', 'B-NP', 'I-PER')]

5. Text Classification

https://img.shields.io/badge/accuracy-86.7%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Install dependencies and download default model

$ pip install Cython
$ pip install joblib future scipy numpy scikit-learn
$ pip install -U fasttext --no-cache-dir --no-deps --force-reinstall
$ underthesea data

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import classify
>>> classify('HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu')
['The thao']
>>> classify('Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế')
['Kinh doanh']
>>> classify('Đánh giá “rạp hát tại gia” Samsung Soundbar Sound+ MS750')
['Vi tinh']

6. Sentiment Analysis

https://img.shields.io/badge/F1-59.5%25-red.svg https://img.shields.io/badge/✎-custom%20models-blue.svg https://img.shields.io/badge/★-api-green.svg

Install dependencies

$ pip install future scipy numpy scikit-learn==0.19.2 joblib

Usage

>>> # -*- coding: utf-8 -*-
>>> from underthesea import sentiment
>>> sentiment('Gọi mấy lần mà lúc nào cũng là các chuyên viên đang bận hết ạ', domain='bank')
('CUSTOMER SUPPORT#NEGATIVE',)
>>> sentiment('bidv cho vay hay ko phu thuoc y thich cua thang tham dinh, ko co quy dinh ro rang', domain='bank')
('LOAN#NEGATIVE',)

Up Coming Features

  • Text to Speech

  • Automatic Speech Recognition

  • Machine Translation

  • Dependency Parsing

Contributing

Do you want to contribute with underthesea development? Great! Please read more details at CONTRIBUTING.rst.

History

1.1.8 (2018-06-20)

  • Fix word_tokenize error when text contains tab (t) character

  • Fix regex_tokenize with url

1.1.7 (2018-04-12)

  • Rename word_sent function to word_tokenize

  • Refactor version control in setup.py file and __init__.py file

  • Update documentation badge url

1.1.6 (2017-12-26)

  • New feature: aspect sentiment analysis

  • Integrate with languageflow 1.1.6

  • Fix bug tokenize string with ‘=’ (#159)

1.1.5 (2017-10-12)

  • New feature: named entity recognition

  • Refactor and update model for word_sent, pos_tag, chunking

1.1.4 (2017-09-12)

  • New feature: text classification

  • [bug] Fix Text error

  • [doc] Add facebook link

1.1.3 (2017-08-30)

1.1.2 (2017-08-22)

  • Add dictionary

1.1.1 (2017-07-05)

  • Support Python 3

  • Refactor feature_engineering code

1.1.0 (2017-05-30)

  • Add chunking feature

  • Add pos_tag feature

  • Add word_sent feature, fix performance

  • Add Corpus class

  • Add Transformer classes

  • Integrated with dictionary of Ho Ngoc Duc

  • Add travis-CI, auto build with PyPI

1.0.0 (2017-03-01)

  • First release on PyPI.

  • First release on Readthedocs

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

underthesea-1.1.9a5.tar.gz (10.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

underthesea-1.1.9a5-py3-none-any.whl (10.0 MB view details)

Uploaded Python 3

File details

Details for the file underthesea-1.1.9a5.tar.gz.

File metadata

  • Download URL: underthesea-1.1.9a5.tar.gz
  • Upload date:
  • Size: 10.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.3.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.3

File hashes

Hashes for underthesea-1.1.9a5.tar.gz
Algorithm Hash digest
SHA256 9aa77ed98db831fb0f0553f10f786a60dd625d2332d2c18f2a53affdd9b18ed8
MD5 6500e127530e7c057e0bb19037b9066f
BLAKE2b-256 f372239e48b8d176e92414bf63b46e5ec8eec3ba197b52b6e978afe659975c98

See more details on using hashes here.

File details

Details for the file underthesea-1.1.9a5-py3-none-any.whl.

File metadata

  • Download URL: underthesea-1.1.9a5-py3-none-any.whl
  • Upload date:
  • Size: 10.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.3.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.3

File hashes

Hashes for underthesea-1.1.9a5-py3-none-any.whl
Algorithm Hash digest
SHA256 752cfa1a8f0313d45f85599cefd241ffe14d02fd29c6fd97784889b0388c6876
MD5 8b5cbadb8d297c6c9ed38d30c0d22d02
BLAKE2b-256 8d8f75a7203523f50589c1c3e2d5054ab1ccfae1930b3b30c7a9636cce3a4146

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page