An open-source Python package for existing NCHLT core technologies for ten South African languages.
Project description
About The Project
This project is an open-source Python package for existing NCHLT core technologies for ten South African languages (Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho sa Leboa, Sesotho, Setswana, Siswati, Tshivenḓa, Xitsonga). The technologies include the following: Tokenisers, Sentence Separators, Part of Speech Taggers, Named Entity Recognisers, Phrase Chunkers, Optical Character Recognisers, and a Language Identifier. Totalling 19 technologies.
Getting Started
To get a local copy up and running, follow these steps.
Prerequisites
- Python 3.8+ (https://www.python.org/downloads/)
- Java OpenJDK 11+ (https://openjdk.org)
Installation
pip
pip install ctextcore
GitHub
# Download the source code from GitHub
git clone https://github.com/ctextdev/ctextcore.git
# Install from source
cd ctextcore
py -m pip install .
# Install from source in Development Mode
cd ctextcore
py -m pip install -e .
Usage
Importing the CTexT Core library
from ctextcore.core import CCore as core
server = core()
The core method accepts the following configuration arguments:
port: 8079 # Set the port the server should use
timeout: 60000 # Set the timeout of HTTP requests
threads: 5 # Set the total number of threads to use
memory: "4G" # Set the maximum memory allowed to be used by the server
be_quiet: False # Set the logging output from the server
max_char_length: 10000 # Set the maximum character length
server = core(port=8081,memory="16G",...)
Downloading models
Download all language models for a specific technology
# This call will download all the language models for POS.
server.download_model(tech='pos', language='all')
Download all technologies for a specific language
# This call will download all the technology models for isiZulu.
server.download_model(tech='all', language='zu')
Download a specific language model for a specific technology
# This call will download the POS technology model for Sesotho sa Leboa.
server.download_model(tech='pos', language='nso')
Using a model
# This call will run the isiZulu POS tagger on the input text 'E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.'.
output_process = server.process_text(text_input='E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.', language='zu', tech='pos')
print(output_process)
from pathlib import Path # Path needs to be imported to be able to use OCR
# This call will run the Sesotho sa Leboa OCR on the image or pdf path provided in the text_input argument.
output_process = server.process_text(text_input=Path('<path-to-image-or-pdf>'), language='nso', tech='ocr')
print(output_process)
# This call will run LID on the input text 'Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.' and the confidence level should be above 50%.
output_process = server.process_text(text_input='Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.', tech='lid', confidence=0.5)
print(output_process)
Output formats
The ctextcore package offers three different output formats (JSON, Delimited, Array), the default output format is JSON and can be changed by providing the output_format argument in the process_text method. An extra argument, delimiter, can be used together with the delimited output format to change the delimiter used in the output. The default delimiter is _.
# This call will run the Afrikaans POS tagger on the input text 'Hierdie is ''n voorbeeldsin om die funksionaliteit te toets.' and will return a delimited output.
output_process = server.process_text(text_input='Hierdie is \'n voorbeeldsin om die funksionaliteit te toets.', language='af', tech='pos', output_format="delimited", delimiter="|")
print(output_process)
Output examples:
# JSON
[{'doc': {'p': {'lid': 'NONE', 'sent': {'tokens': [{'start_char': 0, 'pos': 'PA', 'end_char': 7, 'id': 1, 'text': 'Hierdie'}, {'start_char': 8, 'pos': 'VTHOK', 'end_char': 10, 'id': 2, 'text': 'is'}, {'start_char': 11, 'pos': 'LO', 'end_char': 13, 'id': 3, 'text': "'n"}, {'start_char': 14, 'pos': 'NSE', 'end_char': 26, 'id': 4, 'text': 'voorbeeldsin'}, {'start_char': 27, 'pos': 'SVS', 'end_char': 29, 'id': 5, 'text': 'om'}, {'start_char': 30, 'pos': 'LB', 'end_char': 33, 'id': 6, 'text': 'die'}, {'start_char': 34, 'pos': 'NSE', 'end_char': 49, 'id': 7, 'text': 'funksionaliteit'}, {'start_char': 50, 'pos': 'UPI', 'end_char': 52, 'id': 8, 'text': 'te'}, {'start_char': 53, 'pos': 'VTHSG', 'end_char': 58, 'id': 9, 'text': 'toets'}, {'start_char': 58, 'pos': 'ZE', 'end_char': 59, 'id': 10, 'text': '.'}]}}}}]
# List
[('Hierdie', 'PA'), ('is', 'VTHOK'), ("'n", 'LO'), ('voorbeeldsin', 'NSE'), ('om', 'SVS'), ('die', 'LB'), ('funksionaliteit', 'NSE'), ('te', 'UPI'), ('toets', 'VTHSG'), ('.', 'ZE')]
# Delimited
['Hierdie|PA', 'is|VTHOK', "'n|LO", 'voorbeeldsin|NSE', 'om|SVS', 'die|LB', 'funksionaliteit|NSE', 'te|UPI', 'toets|VTHSG', '.|ZE']
Testing
The ctextcore package uses pytest version 8.0.0 or above as a testing framework and is a required prerequisite to be able to run the unit tests of the package.
Running all the unit tests of the ctextcore package
python -m pytest --pyargs ctextcore.tests
Running individual unit tests of the ctextcore package
The ctextcore package contains the following unit tests:
- lid
- ner
- ocr
- pc
- pos
- sent
- tok
Running an individual unit test
python -m pytest --pyargs ctextcore.tests.test_name
Example
python -m pytest --pyargs ctextcore.tests.test_lid
License
Licensed under the Apache License, Version 2.0. See LICENSE.txt
for more information.
Contact
Centre for Text Technology (CTexT) - ctextdev@gmail.com - https://humanities.nwu.ac.za/ctext
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ctextcore-0.0.2.tar.gz
.
File metadata
- Download URL: ctextcore-0.0.2.tar.gz
- Upload date:
- Size: 60.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 159f7dd6ce0d7993fafe4ec46a95a2582362d8aa02fe9af912ec90c0dfb07a1f |
|
MD5 | 59f02af94ac218ea4dd7399bbb1bf0aa |
|
BLAKE2b-256 | 79344adfc7975f3837b271b30525e999e5f90ee31b639c6d94b519856f5295f2 |
File details
Details for the file ctextcore-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: ctextcore-0.0.2-py3-none-any.whl
- Upload date:
- Size: 60.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bfe347dc3b76f909f526192a6c950e5c056a9a1fd909c5c118ef66dd7fce8999 |
|
MD5 | 7b2882688df53375f98d9875d62a3d6a |
|
BLAKE2b-256 | ca7f7da2bace7bbd65fd4f6d9d7812558d32c40db7a46ad7c2218d9157542743 |