Skip to main content

Vietnamese NLP Toolkit

Project description

========================================
Under The Sea - Vietnamese NLP Toolkit
========================================


.. image:: https://img.shields.io/pypi/v/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea

.. image:: https://img.shields.io/pypi/pyversions/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea

.. image:: https://img.shields.io/pypi/l/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea

.. image:: https://img.shields.io/travis/magizbox/underthesea.svg
:target: https://travis-ci.org/magizbox/underthesea


.. image:: https://readthedocs.com/projects/magizbox-underthesea/badge/?version=latest
:target: https://magizbox-underthesea.readthedocs-hosted.com/en/latest/?badge=latest
:alt: Documentation Status

.. image:: https://pyup.io/repos/github/magizbox/underthesea/shield.svg
:target: https://pyup.io/repos/github/magizbox/underthesea/
:alt: Updates
|
.. image:: https://raw.githubusercontent.com/magizbox/underthesea/master/logo.jpg
:target: https://raw.githubusercontent.com/magizbox/underthesea/master/logo.jpg

**underthesea** is a suite of open source Python modules, data sets and tutorials supporting research and development in Vietnamese Natural Language Processing.

* Free software: GNU General Public License v3
* Documentation: `https://underthesea.readthedocs.io <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/>`_


Installation
----------------------------------------

To install underthesea, simply:

.. code-block:: bash

$ pip install underthesea
✨🍰✨

Satisfaction, guaranteed.

Usage
----------------------------------------

* `1. Corpus <#1-corpus>`_
* `2. Word Segmentation <#2-word-segmentation>`_
* `3. POS Tagging <#3-pos-tagging>`_
* `4. Chunking <#4-chunking>`_

****************************************
1. Corpus
****************************************

.. image:: https://img.shields.io/badge/documents-18k-red.svg
:target: #

.. image:: https://img.shields.io/badge/words-74k-red.svg
:target: #

Collection of Vietnamese corpus

* `Vietnamese Dictionary (74k words) <https://github.com/magizbox/underthesea/tree/master/underthesea/corpus/data>`_

* `Vietnamese News Corpus (10k documents) <https://github.com/magizbox/corpus.vinews>`_
* `Vietnamese Wikipedia Corpus (8k documents) <https://github.com/magizbox/corpus.viwiki>`_

****************************************
2. Word Segmentation
****************************************

.. image:: https://img.shields.io/badge/F1-97%25-red.svg
:target: https://github.com/magizbox/underthesea.word_sent

.. image:: https://img.shields.io/badge/%E2%98%85-can%20beat%20it%3F-blue.svg
:target: https://github.com/magizbox/underthesea.word_sent

Vietnamese Word Segmentation using Conditional Random Fields

* `Word Segmentation API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#word-sent-package>`_
* `Word Segmentation Experiences <https://github.com/magizbox/underthesea.word_sent>`_

.. code-block:: python

>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_sent
>>> sentence = u"Chúng ta thường nói đến Rau sạch, Rau an toàn để phân biệt với các rau bình thường bán ngoài chợ."

>>> word_sent(sentence)
[u"Chúng ta", u"thường", u"nói", u"đến", u"Rau sạch", u",", u"Rau", u"an toàn", u"để", u"phân biệt", u"với",
u"các", u"rau", u"bình thường", u"bán", u"ngoài", u"chợ", u"."]

>>> word_sent(sentence, format="text")
u'Chúng_ta thường nói đến Rau_sạch , Rau an_toàn để phân_biệt với các rau bình_thường bán ngoài chợ .'

****************************************
3. POS Tagging
****************************************

.. image:: https://img.shields.io/badge/accuracy-92.3%25-red.svg
:target: https://github.com/magizbox/underthesea.pos_tag

.. image:: https://img.shields.io/badge/%E2%98%85-can%20beat%20it%3F-blue.svg
:target: https://github.com/magizbox/underthesea.pos_tag

Vietnamese Part of Speech Tagging using Conditional Random Fields

* `POS Tagging API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#pos-tag-package>`_
* `Pos Tagging Experiences <https://github.com/magizbox/underthesea.pos_tag>`_

.. code-block:: python

>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> text = u"Chợ thịt chó nổi tiếng ở TP Hồ Chí Minh bị truy quét"
>>> pos_tag(text)
[(u'Chợ', 'N'),
(u'thịt', 'N'),
(u'chó', 'N'),
(u'nổi tiếng', 'A'),
(u'ở', 'E'),
(u'TP HCM', 'Np'),
(u'bị', 'V'),
(u'truy quét', 'V')]

****************************************
4. Chunking
****************************************

.. image:: https://img.shields.io/badge/F1-85.1%25-red.svg
:target: https://github.com/magizbox/underthesea.chunking

.. image:: https://img.shields.io/badge/%E2%98%85-can%20beat%20it%3F-blue.svg
:target: https://github.com/magizbox/underthesea.chunking

Vietnamese Chunking using Conditional Random Fields

* `Chunking API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#chunking-package>`_
* `Chunking Experiences <https://github.com/magizbox/underthesea.chunking>`_

.. code-block:: python

>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> text = u"Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?"
>>> chunk(text)
[(u'Bác sĩ', 'N', 'B-NP'),
(u'bây giờ', 'P', 'I-NP'),
(u'có thể', 'R', 'B-VP'),
(u'thản nhiên', 'V', 'I-VP'),
(u'báo tin', 'N', 'B-NP'),
(u'bệnh nhân', 'N', 'I-NP'),
(u'bị', 'V', 'B-VP'),
(u'ung thư', 'N', 'I-VP'),
(u'?', 'CH', 'O')]

Up Coming Features
----------------------------------------

* Word Representation (`Word Representation Experiences <https://github.com/magizbox/underthesea.word_representation>`_)
* Dependency Parsing (Experiences)
* Named Entity Recognition
* Sentiment Analysis

Contributing
----------------------------------------

Do you want to contribute with underthesea development? Great! Please read more details at `CONTRIBUTING.rst. <https://github.com/magizbox/underthesea/blob/master/CONTRIBUTING.rst>`_


========================================
History
========================================

1.1.0(2017-05-30)
----------------------------------------

* Add chunking feature
* Add pos_tag feature
* Add word_sent feature, fix performance
* Add Corpus class
* Add Transformer classes
* Integrated with dictionary of Ho Ngoc Duc
* Add travis-CI, auto build with PyPI

1.0.0 (2017-03-01)
----------------------------------------

* First release on PyPI.
* First release on Readthedocs


Project details


Release history Release notifications | RSS feed

This version

1.1.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

underthesea-1.1.1.tar.gz (6.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

underthesea-1.1.1-py2.py3-none-any.whl (5.8 MB view details)

Uploaded Python 2Python 3

File details

Details for the file underthesea-1.1.1.tar.gz.

File metadata

  • Download URL: underthesea-1.1.1.tar.gz
  • Upload date:
  • Size: 6.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for underthesea-1.1.1.tar.gz
Algorithm Hash digest
SHA256 d5c28c66fcdf306bb34fdf64173e281055b6af7ed38e3272e33e6a21650e53bb
MD5 d3390cf61682b1c8e09a820de6a77c3b
BLAKE2b-256 ce23ea4a67213e2a3f7c2ea329e68b34fdfc1115626f063811b21410f0d41143

See more details on using hashes here.

File details

Details for the file underthesea-1.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for underthesea-1.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 120d080dd3b3dcd95566852219eb795c1f6a2ae1299274359cd22893c2ef5a64
MD5 2d67423e7480a5e80cc2f11f5d9554d8
BLAKE2b-256 0e2401145c346136fe7462029c9db771ba3281cccbc7158158a3fcd7483d3c69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page