Vietnamese NLP Toolkit
Project description
========================================
Under The Sea - Vietnamese NLP Toolkit
========================================
.. image:: https://img.shields.io/pypi/v/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea
.. image:: https://img.shields.io/pypi/pyversions/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea
.. image:: https://img.shields.io/pypi/l/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea
.. image:: https://img.shields.io/travis/magizbox/underthesea.svg
:target: https://travis-ci.org/magizbox/underthesea
.. image:: https://readthedocs.com/projects/magizbox-underthesea/badge/?version=latest
:target: http://underthesea.readthedocs.io/en/latest/
:alt: Documentation Status
.. image:: https://pyup.io/repos/github/magizbox/underthesea/shield.svg
:target: https://pyup.io/repos/github/magizbox/underthesea/
:alt: Updates
.. image:: https://img.shields.io/badge/chat-on%20facebook-green.svg
:target: https://www.facebook.com/undertheseanlp/
|
.. image:: https://raw.githubusercontent.com/magizbox/underthesea/master/logo.jpg
:target: https://raw.githubusercontent.com/magizbox/underthesea/master/logo.jpg
**underthesea** is a suite of open source Python modules, data sets and tutorials supporting research and development in Vietnamese Natural Language Processing.
* Free software: GNU General Public License v3
* Documentation: `https://underthesea.readthedocs.io <http://underthesea.readthedocs.io/en/latest/>`_
* Live demo: `underthesea app <http://magizbox.com:9386/#/>`_
* Facebook Page: `https://www.facebook.com/undertheseanlp/ <https://www.facebook.com/undertheseanlp/>`_
Installation
----------------------------------------
Install dependencies
.. code-block:: bash
$ pip install Cython numpy scipy sklearn fasttext python-crfsuite
To install underthesea, simply:
.. code-block:: bash
$ pip install underthesea
$ underthesea data
✨🍰✨
Satisfaction, guaranteed.
Usage
----------------------------------------
* `1. Corpus <#1-corpus>`_
* `2. Word Segmentation <#2-word-segmentation>`_
* `3. POS Tagging <#3-pos-tagging>`_
* `4. Chunking <#4-chunking>`_
* `5. Text Classification <#5-text-classification>`_
****************************************
1. Corpus
****************************************
.. image:: https://img.shields.io/badge/documents-18k-red.svg
:target: #
.. image:: https://img.shields.io/badge/words-74k-red.svg
:target: #
Collection of Vietnamese corpus
* `Vietnamese Dictionary (74k words) <https://github.com/magizbox/underthesea/tree/master/underthesea/corpus/data>`_
* `Vietnamese News Corpus (10k documents) <https://github.com/magizbox/corpus.vinews>`_
* `Vietnamese Wikipedia Corpus (8k documents) <https://github.com/magizbox/corpus.viwiki>`_
****************************************
2. Word Segmentation
****************************************
.. image:: https://img.shields.io/badge/F1-97%25-red.svg
:target: https://github.com/magizbox/underthesea.word_sent
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.word_sent
Vietnamese Word Segmentation using Conditional Random Fields
* `Word Segmentation API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#word-sent-package>`_
* `Word Segmentation Experiments <https://github.com/magizbox/underthesea.word_sent>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_sent
>>> sentence = u"Chúng ta thường nói đến Rau sạch, Rau an toàn để phân biệt với các rau bình thường bán ngoài chợ."
>>> word_sent(sentence)
[u"Chúng ta", u"thường", u"nói", u"đến", u"Rau sạch", u",", u"Rau", u"an toàn", u"để", u"phân biệt", u"với",
u"các", u"rau", u"bình thường", u"bán", u"ngoài", u"chợ", u"."]
>>> word_sent(sentence, format="text")
u'Chúng_ta thường nói đến Rau_sạch , Rau an_toàn để phân_biệt với các rau bình_thường bán ngoài chợ .'
****************************************
3. POS Tagging
****************************************
.. image:: https://img.shields.io/badge/accuracy-92.3%25-red.svg
:target: https://github.com/magizbox/underthesea.pos_tag
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.pos_tag
Vietnamese Part of Speech Tagging using Conditional Random Fields
* `POS Tagging API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#pos-tag-package>`_
* `Pos Tagging Experiments <https://github.com/magizbox/underthesea.pos_tag>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> text = u"Chợ thịt chó nổi tiếng ở TP Hồ Chí Minh bị truy quét"
>>> pos_tag(text)
[(u'Chợ', 'N'),
(u'thịt', 'N'),
(u'chó', 'N'),
(u'nổi tiếng', 'A'),
(u'ở', 'E'),
(u'TP HCM', 'Np'),
(u'bị', 'V'),
(u'truy quét', 'V')]
****************************************
4. Chunking
****************************************
.. image:: https://img.shields.io/badge/F1-85.1%25-red.svg
:target: https://github.com/magizbox/underthesea.chunking
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.chunking
Vietnamese Chunking using Conditional Random Fields
* `Chunking API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#chunking-package>`_
* `Chunking Experiments <https://github.com/magizbox/underthesea.chunking>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> text = u"Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?"
>>> chunk(text)
[(u'Bác sĩ', 'N', 'B-NP'),
(u'bây giờ', 'P', 'I-NP'),
(u'có thể', 'R', 'B-VP'),
(u'thản nhiên', 'V', 'I-VP'),
(u'báo tin', 'N', 'B-NP'),
(u'bệnh nhân', 'N', 'I-NP'),
(u'bị', 'V', 'B-VP'),
(u'ung thư', 'N', 'I-VP'),
(u'?', 'CH', 'O')]
****************************************
5. Text Classification
****************************************
.. image:: https://img.shields.io/badge/accuracy-86.7%25-red.svg
:target: https://github.com/magizbox/underthesea.classification
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.classification
Vietnamese Text Classification using Fasttext
* `Classification API <http://underthesea.readthedocs.io/en/latest/api.html#classify-package>`_
* `Classification Experiments <https://github.com/magizbox/underthesea.classification>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import classify
>>> classify("HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu")
['The thao']
>>> classify("Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế")
['Kinh doanh']
>>> classify("Đánh giá “rạp hát tại gia” Samsung Soundbar Sound+ MS750")
['Vi tinh']
Up Coming Features
----------------------------------------
* Sentiment Analysis
* Word Representation (`Word Representation Experiments <https://github.com/magizbox/underthesea.word_representation>`_)
* Dependency Parsing
* Named Entity Recognition
Contributing
----------------------------------------
Do you want to contribute with underthesea development? Great! Please read more details at `CONTRIBUTING.rst. <https://github.com/magizbox/underthesea/blob/master/CONTRIBUTING.rst>`_
================================================================================
History
================================================================================
1.1.4 (2017-09-12)
--------------------------------------------------------------------------------
* New feature: text classification
* [bug] Fix Text error
* [doc] Add facebook link
1.1.3 (2017-08-30)
--------------------------------------------------------------------------------
* Add live demo: https://underthesea.herokuapp.com/
1.1.2 (2017-08-22)
--------------------------------------------------------------------------------
* Add dictionary
1.1.1 (2017-07-05)
--------------------------------------------------------------------------------
* Support Python 3
* Refactor feature_engineering code
1.1.0 (2017-05-30)
--------------------------------------------------------------------------------
* Add chunking feature
* Add pos_tag feature
* Add word_sent feature, fix performance
* Add Corpus class
* Add Transformer classes
* Integrated with dictionary of Ho Ngoc Duc
* Add travis-CI, auto build with PyPI
1.0.0 (2017-03-01)
--------------------------------------------------------------------------------
* First release on PyPI.
* First release on Readthedocs
Under The Sea - Vietnamese NLP Toolkit
========================================
.. image:: https://img.shields.io/pypi/v/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea
.. image:: https://img.shields.io/pypi/pyversions/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea
.. image:: https://img.shields.io/pypi/l/underthesea.svg
:target: https://pypi.python.org/pypi/underthesea
.. image:: https://img.shields.io/travis/magizbox/underthesea.svg
:target: https://travis-ci.org/magizbox/underthesea
.. image:: https://readthedocs.com/projects/magizbox-underthesea/badge/?version=latest
:target: http://underthesea.readthedocs.io/en/latest/
:alt: Documentation Status
.. image:: https://pyup.io/repos/github/magizbox/underthesea/shield.svg
:target: https://pyup.io/repos/github/magizbox/underthesea/
:alt: Updates
.. image:: https://img.shields.io/badge/chat-on%20facebook-green.svg
:target: https://www.facebook.com/undertheseanlp/
|
.. image:: https://raw.githubusercontent.com/magizbox/underthesea/master/logo.jpg
:target: https://raw.githubusercontent.com/magizbox/underthesea/master/logo.jpg
**underthesea** is a suite of open source Python modules, data sets and tutorials supporting research and development in Vietnamese Natural Language Processing.
* Free software: GNU General Public License v3
* Documentation: `https://underthesea.readthedocs.io <http://underthesea.readthedocs.io/en/latest/>`_
* Live demo: `underthesea app <http://magizbox.com:9386/#/>`_
* Facebook Page: `https://www.facebook.com/undertheseanlp/ <https://www.facebook.com/undertheseanlp/>`_
Installation
----------------------------------------
Install dependencies
.. code-block:: bash
$ pip install Cython numpy scipy sklearn fasttext python-crfsuite
To install underthesea, simply:
.. code-block:: bash
$ pip install underthesea
$ underthesea data
✨🍰✨
Satisfaction, guaranteed.
Usage
----------------------------------------
* `1. Corpus <#1-corpus>`_
* `2. Word Segmentation <#2-word-segmentation>`_
* `3. POS Tagging <#3-pos-tagging>`_
* `4. Chunking <#4-chunking>`_
* `5. Text Classification <#5-text-classification>`_
****************************************
1. Corpus
****************************************
.. image:: https://img.shields.io/badge/documents-18k-red.svg
:target: #
.. image:: https://img.shields.io/badge/words-74k-red.svg
:target: #
Collection of Vietnamese corpus
* `Vietnamese Dictionary (74k words) <https://github.com/magizbox/underthesea/tree/master/underthesea/corpus/data>`_
* `Vietnamese News Corpus (10k documents) <https://github.com/magizbox/corpus.vinews>`_
* `Vietnamese Wikipedia Corpus (8k documents) <https://github.com/magizbox/corpus.viwiki>`_
****************************************
2. Word Segmentation
****************************************
.. image:: https://img.shields.io/badge/F1-97%25-red.svg
:target: https://github.com/magizbox/underthesea.word_sent
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.word_sent
Vietnamese Word Segmentation using Conditional Random Fields
* `Word Segmentation API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#word-sent-package>`_
* `Word Segmentation Experiments <https://github.com/magizbox/underthesea.word_sent>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import word_sent
>>> sentence = u"Chúng ta thường nói đến Rau sạch, Rau an toàn để phân biệt với các rau bình thường bán ngoài chợ."
>>> word_sent(sentence)
[u"Chúng ta", u"thường", u"nói", u"đến", u"Rau sạch", u",", u"Rau", u"an toàn", u"để", u"phân biệt", u"với",
u"các", u"rau", u"bình thường", u"bán", u"ngoài", u"chợ", u"."]
>>> word_sent(sentence, format="text")
u'Chúng_ta thường nói đến Rau_sạch , Rau an_toàn để phân_biệt với các rau bình_thường bán ngoài chợ .'
****************************************
3. POS Tagging
****************************************
.. image:: https://img.shields.io/badge/accuracy-92.3%25-red.svg
:target: https://github.com/magizbox/underthesea.pos_tag
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.pos_tag
Vietnamese Part of Speech Tagging using Conditional Random Fields
* `POS Tagging API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#pos-tag-package>`_
* `Pos Tagging Experiments <https://github.com/magizbox/underthesea.pos_tag>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import pos_tag
>>> text = u"Chợ thịt chó nổi tiếng ở TP Hồ Chí Minh bị truy quét"
>>> pos_tag(text)
[(u'Chợ', 'N'),
(u'thịt', 'N'),
(u'chó', 'N'),
(u'nổi tiếng', 'A'),
(u'ở', 'E'),
(u'TP HCM', 'Np'),
(u'bị', 'V'),
(u'truy quét', 'V')]
****************************************
4. Chunking
****************************************
.. image:: https://img.shields.io/badge/F1-85.1%25-red.svg
:target: https://github.com/magizbox/underthesea.chunking
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.chunking
Vietnamese Chunking using Conditional Random Fields
* `Chunking API <https://magizbox-underthesea.readthedocs-hosted.com/en/latest/api.html#chunking-package>`_
* `Chunking Experiments <https://github.com/magizbox/underthesea.chunking>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import chunk
>>> text = u"Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?"
>>> chunk(text)
[(u'Bác sĩ', 'N', 'B-NP'),
(u'bây giờ', 'P', 'I-NP'),
(u'có thể', 'R', 'B-VP'),
(u'thản nhiên', 'V', 'I-VP'),
(u'báo tin', 'N', 'B-NP'),
(u'bệnh nhân', 'N', 'I-NP'),
(u'bị', 'V', 'B-VP'),
(u'ung thư', 'N', 'I-VP'),
(u'?', 'CH', 'O')]
****************************************
5. Text Classification
****************************************
.. image:: https://img.shields.io/badge/accuracy-86.7%25-red.svg
:target: https://github.com/magizbox/underthesea.classification
.. image:: https://img.shields.io/badge/%E2%98%85-experiments-blue.svg
:target: https://github.com/magizbox/underthesea.classification
Vietnamese Text Classification using Fasttext
* `Classification API <http://underthesea.readthedocs.io/en/latest/api.html#classify-package>`_
* `Classification Experiments <https://github.com/magizbox/underthesea.classification>`_
.. code-block:: python
>>> # -*- coding: utf-8 -*-
>>> from underthesea import classify
>>> classify("HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu")
['The thao']
>>> classify("Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế")
['Kinh doanh']
>>> classify("Đánh giá “rạp hát tại gia” Samsung Soundbar Sound+ MS750")
['Vi tinh']
Up Coming Features
----------------------------------------
* Sentiment Analysis
* Word Representation (`Word Representation Experiments <https://github.com/magizbox/underthesea.word_representation>`_)
* Dependency Parsing
* Named Entity Recognition
Contributing
----------------------------------------
Do you want to contribute with underthesea development? Great! Please read more details at `CONTRIBUTING.rst. <https://github.com/magizbox/underthesea/blob/master/CONTRIBUTING.rst>`_
================================================================================
History
================================================================================
1.1.4 (2017-09-12)
--------------------------------------------------------------------------------
* New feature: text classification
* [bug] Fix Text error
* [doc] Add facebook link
1.1.3 (2017-08-30)
--------------------------------------------------------------------------------
* Add live demo: https://underthesea.herokuapp.com/
1.1.2 (2017-08-22)
--------------------------------------------------------------------------------
* Add dictionary
1.1.1 (2017-07-05)
--------------------------------------------------------------------------------
* Support Python 3
* Refactor feature_engineering code
1.1.0 (2017-05-30)
--------------------------------------------------------------------------------
* Add chunking feature
* Add pos_tag feature
* Add word_sent feature, fix performance
* Add Corpus class
* Add Transformer classes
* Integrated with dictionary of Ho Ngoc Duc
* Add travis-CI, auto build with PyPI
1.0.0 (2017-03-01)
--------------------------------------------------------------------------------
* First release on PyPI.
* First release on Readthedocs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
underthesea-1.1.4rc1.tar.gz
(8.3 MB
view hashes)
Built Distribution
Close
Hashes for underthesea-1.1.4rc1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c699a0fa1ac97ea67fb96af54e619edc6a1e6b07d77d11590f0f442030fce0e |
|
MD5 | 9e0010ace3531bbdba0e6dd9909af548 |
|
BLAKE2b-256 | 7a126bce280cc97c64952689bc42b59c8986bdde2f8f44f4eb22a909c3e591af |