Skip to main content

An open-source Python package for existing NCHLT core technologies for ten South African languages.

Project description

About The Project

This project is an open-source Python package for existing NCHLT core technologies for ten South African languages (Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho sa Leboa, Sesotho, Setswana, Siswati, Tshivenḓa, Xitsonga). The technologies include the following: Tokenisers, Sentence Separators, Part of Speech Taggers, Named Entity Recognisers, Phrase Chunkers, Optical Character Recognisers, and a Language Identifier. Totalling 19 technologies.

Getting Started

To get a local copy up and running, follow these steps.

Prerequisites

Installation

pip

pip install ctextcore

GitHub

# Download the source code from GitHub
git clone https://github.com/ctextdev/ctextcore.git

# Install from source
cd ctextcore
py -m pip install .

# Install from source in Development Mode
cd ctextcore
py -m pip install -e .

Usage

Importing the CTexT Core library

from ctextcore.core import CCore as core
server = core()

The core method accepts the following configuration arguments:

port: 8079              # Set the port the server should use
timeout: 60000          # Set the timeout of HTTP requests
threads: 5              # Set the total number of threads to use
memory: "4G"            # Set the maximum memory allowed to be used by the server
be_quiet: False         # Set the logging output from the server
max_char_length: 10000  # Set the maximum character length

server = core(port=8081,memory="16G",...)

Downloading models

Download all language models for a specific technology

# This call will download all the language models for POS.
server.download_model(tech='pos', language='all')

Download all technologies for a specific language

# This call will download all the technology models for isiZulu.
server.download_model(tech='all', language='zu')

Download a specific language model for a specific technology

# This call will download the POS technology model for Sesotho sa Leboa.
server.download_model(tech='pos', language='nso')

Using a model

# This call will run the isiZulu POS tagger on the input text 'E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.'.
output_process = server.process_text(text_input='E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.', language='zu', tech='pos')
print(output_process)

from pathlib import Path # Path needs to be imported to be able to use OCR

# This call will run the Sesotho sa Leboa OCR on the image or pdf path provided in the text_input argument.
output_process = server.process_text(text_input=Path('<path-to-image-or-pdf>'), language='nso', tech='ocr')
print(output_process)

# This call will run LID on the input text 'Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.' and the confidence level should be above 50%.
output_process = server.process_text(text_input='Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.', tech='lid', confidence=0.5)
print(output_process)

Output formats

The ctextcore package offers three different output formats (JSON, Delimited, Array), the default output format is JSON and can be changed by providing the output_format argument in the process_text method. An extra argument, delimiter, can be used together with the delimited output format to change the delimiter used in the output. The default delimiter is _.

# This call will run the Afrikaans POS tagger on the input text 'Hierdie is ''n voorbeeldsin om die funksionaliteit te toets.' and will return a delimited output.
output_process = server.process_text(text_input='Hierdie is \'n voorbeeldsin om die funksionaliteit te toets.', language='af', tech='pos', output_format="delimited", delimiter="|")
print(output_process)

Output examples:

# JSON
[{'doc': {'p': {'lid': 'NONE', 'sent': {'tokens': [{'start_char': 0, 'pos': 'PA', 'end_char': 7, 'id': 1, 'text': 'Hierdie'}, {'start_char': 8, 'pos': 'VTHOK', 'end_char': 10, 'id': 2, 'text': 'is'}, {'start_char': 11, 'pos': 'LO', 'end_char': 13, 'id': 3, 'text': "'n"}, {'start_char': 14, 'pos': 'NSE', 'end_char': 26, 'id': 4, 'text': 'voorbeeldsin'}, {'start_char': 27, 'pos': 'SVS', 'end_char': 29, 'id': 5, 'text': 'om'}, {'start_char': 30, 'pos': 'LB', 'end_char': 33, 'id': 6, 'text': 'die'}, {'start_char': 34, 'pos': 'NSE', 'end_char': 49, 'id': 7, 'text': 'funksionaliteit'}, {'start_char': 50, 'pos': 'UPI', 'end_char': 52, 'id': 8, 'text': 'te'}, {'start_char': 53, 'pos': 'VTHSG', 'end_char': 58, 'id': 9, 'text': 'toets'}, {'start_char': 58, 'pos': 'ZE', 'end_char': 59, 'id': 10, 'text': '.'}]}}}}]

# List
[('Hierdie', 'PA'), ('is', 'VTHOK'), ("'n", 'LO'), ('voorbeeldsin', 'NSE'), ('om', 'SVS'), ('die', 'LB'), ('funksionaliteit', 'NSE'), ('te', 'UPI'), ('toets', 'VTHSG'), ('.', 'ZE')]

# Delimited
['Hierdie|PA', 'is|VTHOK', "'n|LO", 'voorbeeldsin|NSE', 'om|SVS', 'die|LB', 'funksionaliteit|NSE', 'te|UPI', 'toets|VTHSG', '.|ZE']

Testing

The ctextcore package uses pytest version 8.0.0 or above as a testing framework and is a required prerequisite to be able to run the unit tests of the package.

Running all the unit tests of the ctextcore package

python -m pytest --pyargs ctextcore.tests

Running individual unit tests of the ctextcore package

The ctextcore package contains the following unit tests:

  • lid
  • ner
  • ocr
  • pc
  • pos
  • sent
  • tok

Running an individual unit test

python -m pytest --pyargs ctextcore.tests.test_name

Example

python -m pytest --pyargs ctextcore.tests.test_lid

License

Licensed under the Apache License, Version 2.0. See LICENSE.txt for more information.

Contact

Centre for Text Technology (CTexT) - ctextdev@gmail.com - https://humanities.nwu.ac.za/ctext

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ctextcore-0.0.2.tar.gz (60.4 MB view details)

Uploaded Source

Built Distribution

ctextcore-0.0.2-py3-none-any.whl (60.6 MB view details)

Uploaded Python 3

File details

Details for the file ctextcore-0.0.2.tar.gz.

File metadata

  • Download URL: ctextcore-0.0.2.tar.gz
  • Upload date:
  • Size: 60.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for ctextcore-0.0.2.tar.gz
Algorithm Hash digest
SHA256 159f7dd6ce0d7993fafe4ec46a95a2582362d8aa02fe9af912ec90c0dfb07a1f
MD5 59f02af94ac218ea4dd7399bbb1bf0aa
BLAKE2b-256 79344adfc7975f3837b271b30525e999e5f90ee31b639c6d94b519856f5295f2

See more details on using hashes here.

File details

Details for the file ctextcore-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: ctextcore-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 60.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for ctextcore-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bfe347dc3b76f909f526192a6c950e5c056a9a1fd909c5c118ef66dd7fce8999
MD5 7b2882688df53375f98d9875d62a3d6a
BLAKE2b-256 ca7f7da2bace7bbd65fd4f6d9d7812558d32c40db7a46ad7c2218d9157542743

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page