A minimal package that detect Cantonese sentences in Traditional Chinese text.
Project description
CantoneseDetect 粵語特徵分類器
本項目為 canto-filter 之後續。canto-filter 得 4 個分類標籤且判斷邏輯更加快速簡單,適合在線快速篩選判別文本或者其他要求低延遲、速度快嘅應用場合。本項目採用更精細嘅判斷邏輯,有 6 個分類標籤,準確度更高,但速度亦會相對 canto-filter 更慢。
This is an extension of the canto-filter project. canto-filter has only 4 output labels. It has a simipler classification logic and is faster, more suitable for use cases which require low-latency or high classification speed. This package has 6 output and uses a more sophisticated classification logic for more fine-grained classification. It has higher classification accuracy but slower performance.
引用 Citation
分類器採用嘅分類標籤及基準,參考咗對使用者嘅語言意識形態嘅研究。討論分類準則時,請引用:
The definitions and boundaries of the labels depend on the user's language ideology. When discussing the criteria adopted by this tool, please cite:
Chaak-ming Lau, Mingfei Lau, and Ann Wai Huen To. 2024. The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection. In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024, pages 24–29, Torino, Italia. ELRA and ICCL.
簡介 Introduction
分類方法係利用粵語同書面中文嘅特徵字詞,用 Regex 方式加以識別。分類器主要有兩個主要參數,--split
同埋--quotes
,兩個默認都係False
。
The filter is based on Regex rules and detects lexical features specific to Cantonese or Written-Chiense.
分句參數--split
呢個參數默認關閉,如果打開,分類器會用句號、問號、感歎號等標點符號將輸入文本切成單句,對每個單句分類判斷,然後再按照下面判別標準整合嚟得到最終分類。所以呢個參數喺輸入都係單句嘅情況下唔會有區別,只會降低運行速度。喺官粵混雜比較多而且比較長嘅文本輸入下會有更多唔同。
目前因為整合分句判斷嘅邏輯比較嚴,所以如果打開,會相比於關閉更加容易將其他類別判斷為mixed
。所以對於篩選純粵文嘅用途嚟講,打開呢個參數會提高 precision 但降低 recall。
分類標籤參數--quotes
呢個參數默認關閉,分類器淨係會將輸入分為 4 類。如果打開,就會再增加兩類總共有 6 個標籤。打開後分類器會將引號內嘅文本抽出嚟,將佢哋同引號外文本分開判斷。下面一段就係介紹呢四個同六個標籤。
標籤 Labels
分類器會將輸入文本分成四類(粗疏)或六類(精細),分類如下:
The classifiers output four (coarse) or six (fine-grained) categories. The labels are:
Cantonese
: 純粵文,僅含有粵語特徵字詞,例如“你喺邊度” | Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度SWC
: 書面中文,係一個僅含有書面語特徵字詞,例如“你在哪裏” | Pure Standard Written Chinese (SWC) text, contains Mandarin-feature words. E.g. 你在哪裏Mixed
:書粵混雜文,同時含有書面語同粵語特徵嘅字詞,例如“是咁的” | Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的Neutral
:無特徵中文,唔含有官話同粵語特徵,既可以當成粵文亦可以當成官話文,例如“去學校讀書” | No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書MixedQuotesInSWC
: 書面中文,引文入面係Mixed
|Mixed
contents quoted within SWC textCantoneseQuotesInSWC
: 書面中文,引文入面係純粵文cantonese
|Cantonese
contents quoted within SWC text
系統要求 Requirement
Python >= 3.11
安裝 Installation
pip install cantonesedetect
用法 Usage
可以通過 Python 函數嚟引用,亦可以直接 CLI 調用。
You can call the Python API or this library, or run it directly in CLI.
Python
用下面嘅方法創建一個 Detector
,然後直接調用 judge()
就可以得到分類結果:
Initialize a Detector
and call the judge()
function on inputs, and you will get the classification outputs.
from cantonesedetect import CantoneseDetector
# 默認情況下 use_quotes=False, split_seg=False, get_analysis=False
detector = CantoneseDetector()
detector.judge('你喺邊度') # cantonese
detector.judge('你在哪裏') # swc
detector.judge('是咁的') # mixed
detector.judge('去學校讀書') # neutral
detector.judge('他説:“係噉嘅。”') # cantonese_quotes_in_swc
detector.judge('那就「是咁的」') # mixed_quotes_in_swc
如果想要用引號抽取判別、分句判別同埋獲得分析結果,可以:
If you want to judge inputs based on matrix-quote-splitting, or spliting into segments, you can:
from cantonesedetect import Detector
detector = Detector(use_quotes=True, split_seg=True, get_analysis=True)
judgement, document_features = detector.judge("他説:「我哋今晚食飯。你想去邊度食?」")
# 打印分析結果
# Print analysis results
print(document_features.get_analysis())
# `document_features` 入面有每個分句嘅 `document_segments_features` 同 `document_segments_judgements`
# `document_features` object contains `document_segments_features` which is a list of segment features
print(document_features.document_segments_features[0].canto_feature)
print(document_features.document_segments_features[0].canto_exclude)
print(document_features.document_segments_features[0].swc_feature)
print(document_features.document_segments_features[0].swc_exclude)
# Also contains `document_segments_judgements` which is a list of judgements of the segments
print([j.value for j in document_features.document_segments_judgements])
CLI
如果直接喺 CLI 調用嘅話,只需要指明--input
就得。 --quotes
、--split
、--print_analysis
三個參數都默認關閉,如果標明就會打開:
If you run directly in CLI, simply specify the --input
. The optional arguments --quotes
、--split
、--print_analysis
are all False
by default, and you can turn them on by specifying them.
cantonesedetect --input input.txt
# 開啓引號抽取判別、分句判別並且打印分析結果
# Enable matrix-quotes-splitting, segment-splitting and printing the analysis.
cantonesedetect --input input.txt --quotes --split --print_analysis
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cantonesedetect-1.1.1.tar.gz
.
File metadata
- Download URL: cantonesedetect-1.1.1.tar.gz
- Upload date:
- Size: 16.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91e2d69f5a5bbcd59a4c42a7e3bfbb66ca9fddd4d86e1924a38272051422b3ab |
|
MD5 | 22ebcf6cb3c5ef6cbfde5fce5caeb416 |
|
BLAKE2b-256 | d1071f5d321b9448016e283ad48d68b5456562e2dbac2e7c2c342c155e1b5cec |
File details
Details for the file cantonesedetect-1.1.1-py3-none-any.whl
.
File metadata
- Download URL: cantonesedetect-1.1.1-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd55787c2bf58bd5777133b66de30f19a19eb4b7afc88b8da893839999a10af8 |
|
MD5 | 4625fdbb2d91a42bca4a7b1d95ed5c9a |
|
BLAKE2b-256 | d3e9e300ce8bbcbeaf552893bdcfc1c3b8f72b3cf9a438a725386c80d9b0911e |