Skip to main content

Python Vietnamese Toolkit

Project description

Python Vietnamese Toolkit

What’s New (0.1)

  • Retrain a new tokenization model on a much bigger dataset. F1 score =0.985

  • Add training data and training code

  • Better integration to (removing redundant spaces between tokens after tokenization. Eg. Việt Nam , 12 / 22 / 2020 => Việt Nam, 12/22/2020]


  • Tokenization

  • POS tagging

  • Accents removal

  • Accents adding

Algorithm: Conditional Random Field

Vietnamese tokenizer f1_score = 0.985

Vietnamese pos tagging f1_score = 0.925


  • A - Adjective

  • C - Coordinating conjunction

  • E - Preposition

  • I - Interjection

  • L - Determiner

  • M - Numeral

  • N - Common noun

  • Nc - Noun Classifier

  • Ny - Noun abbreviation

  • Np - Proper noun

  • Nu - Unit noun

  • P - Pronoun

  • R - Adverb

  • S - Subordinating conjunction

  • T - Auxiliary, modal words

  • V - Verb

  • X - Unknown

  • F - Filtered out (punctuation)


At the command line with pip

$ pip install pyvi


$ pip uninstall pyvi


from pyvi import ViTokenizer, ViPosTagger

ViTokenizer.tokenize(u"Trường đại học bách khoa hà nội")

ViPosTagger.postagging(ViTokenizer.tokenize(u"Trường đại học Bách Khoa Hà Nội")

from pyvi import ViUtils
ViUtils.remove_accents(u"Trường đại học bách khoa hà nội")

from pyvi import ViUtils
ViUtils.add_accents(u'truong dai hoc bach khoa ha noi')

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyvi-0.1.1.tar.gz (8.4 MB view hashes)

Uploaded source

Built Distribution

pyvi-0.1.1-py2.py3-none-any.whl (8.5 MB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page