Skip to main content

SEACoreNLP: A Python NLP Toolkit for Southeast Asian languages

Project description

SEACoreNLP: A Python NLP Toolkit for Southeast Asian Languages

SEACoreNLP is an initiative by NLPHub of AI Singapore that aims to provide a one-stop solution for Natural Language Processing (NLP) in Southeast Asia.

It brings together the available open-source resources (be it datasets, models or libraries) and unifies them with a single framework. We also train models on available data whenever the opportunity arises and provide them through our package on top of the third-party libraries and models.

Demo

Please refer to our demo to see our models in action.

Languages Supported

We currently support the following languages:

  • Indonesian
  • Thai
  • Vietnamese

Core NLP Tasks

The core NLP tasks that we cover are as follows:

  • Word Tokenization
  • Sentence Segmentation
  • Part-of-speech Tagging (POS Tagging)
  • Named Entity Recognition (NER)
  • Constituency Parsing
  • Dependency Parsing

Installation

pip install seacorenlp

If you wish to make use of models from Stanza (one of the third-party libraries that we use), ensure that you also install the relevant models after installing seacorenlp.

import stanza

stanza.download('id') # Download Indonesian models
stanza.download('vi') # Download Vietnamese models

# Stanza does not have models for Thai

As there are some dependency conflicts between the latest version of underthesea (a package for Vietnamese NLP that SEACoreNLP depends on) and the other packages used in SEACoreNLP, we are installing an earlier version (1.2.3) that does not have conflicts. However, this version does not contain the Vietnamese dependency parser, so if you wish to make use of that, please manually upgrade the version of underthesea to the latest version.

You may consider using our natively trained Vietnamese dependency parsers if you do not wish to perform this manual upgrade:

from seacorenlp.parsing import DependencyParser

# Load best Vietnamese dependency parser trained on Universal Dependencies data
parser = DependencyParser.from_pretrained("dp-vi-ud-xlmr-best")
parser.predict("Tôi muốn ăn cơm.")

Usage

We provide a command-line interface for training, evaluation and prediction. We also provide classes (such as Tokenizer, POSTagger, NERTagger etc.) and models that can be used in a manner reminiscent of Huggingface Transformers.

from seacorenlp.tagging import POSTagger

th_text = 'ผมอยากกินข้าว'

# Native Models
native_tagger = POSTagger.from_pretrained('pos-th-ud-xlmr')
native_tagger.predict(th_text)
# Output: [('ผม', 'PRON'), ('อยาก', 'VERB'), ('กิน', 'VERB'), ('ข้าว', 'NOUN')]

# External Models
# Include keyword arguments as necessary (see respective class documentation)
external_tagger = POSTagger.from_library('pythainlp', corpus='orchid')
external_tagger.predict(th_text)
# Output: [('ผม', 'PPRS'), ('อยาก', 'XVMM'), ('กิน', 'VACT'), ('ข้าว', 'NCMN')]

Please refer to our documentation for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seacorenlp-0.0.2.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seacorenlp-0.0.2-py3-none-any.whl (59.7 kB view details)

Uploaded Python 3

File details

Details for the file seacorenlp-0.0.2.tar.gz.

File metadata

  • Download URL: seacorenlp-0.0.2.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.9

File hashes

Hashes for seacorenlp-0.0.2.tar.gz
Algorithm Hash digest
SHA256 1164a44b24f31e0f66380aba29720c510cab93d1d72553349dc4d02195793614
MD5 b3e3a8a15d1dec1314d5d03cade29879
BLAKE2b-256 816c83a48ee1e7460d0ce424c37357f1cf1223d9495dcf5e54a978ebca09ff79

See more details on using hashes here.

File details

Details for the file seacorenlp-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: seacorenlp-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 59.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.9

File hashes

Hashes for seacorenlp-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2525dfc2fc4c2322e980819965b0a746443335f07be9373eaaca155baeda2a89
MD5 b214fc379e8ff1af56820065a17b8fb5
BLAKE2b-256 0d0824fe5684368ce12b009bcd56035398b357b19e9d568921cf40d6ee646178

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page