粵文分類篩選器 Cantonese text filter
Project description
粵文分類篩選器
簡介
呢個係個粵文篩選器,用嚟區分粵語同官話文本,對於篩選粵語語料好有用。個分類器會將輸入文本分成四類:
cantonese
: 純粵文,僅含有粵語特徵字詞,例如“你喺邊度”mandarin
: 純官話文,僅含有官話特徵字詞,例如“你在哪裏”mixed
:官粵混雜文,同時含有官話同粵語特徵嘅字詞,例如“是咁的”neutral
:無特徵漢語文,唔含有官話同粵語特徵,既可以當成粵文亦可以當成官話文,例如“去學校讀書”
分類方法係官話同粵語嘅特徵字詞識別。如果同時含有官話同粵語特徵詞彙就算官粵混雜,如果唔含有任何特徵,就算冇特徵中性文本。
本篩選器嘅主要設計目標係「篩選出可以用作訓練數據嘅優質粵文」,而非「準確分類輸入文本」。所以喺判斷粵語/官話嗰陣會用偏嚴格嘅判別標準,即係會犧牲 recall 嚟換取高 precision (寧願篩漏粵文句子都唔好將官話文誤判成粵文)。
注意:呢隻分類器默認所有輸入文本都係傳統漢字。如果要分類簡化字文本,要將佢哋轉化成傳統漢字先。推薦使用 OpenCC嚟轉換。
用法
首先用 pip 安裝
pip install canto-filter
你可以喺 Python 代碼入面用,亦都可以直接喺命令行入面用。
Python 函數用法
本篩選器剩得一個函數 judge()
,輸入一句話輸出佢嘅語言分類:
from cantofilter import judge
print(judge('你喺邊度')) # cantonese
print(judge('你在哪裏')) # mandarin
print(judge('是咁的')) # mixed
print(judge('去學校讀書')) # neutral
命令行用法
首先要有一個輸入文檔,例如input.txt
,入面每行一個句子.
輸出標籤同原文
然後運行下面命令
cantofilter --input input.txt > output.txt
噉樣會得到一個 output.txt
,入面有由 \t 分成嘅兩列,第一列係判斷標籤,第二列係句子原文本。
僅輸出一類
如果你想直接篩選出某一類嘅文本,噉可以加一個 --type <LABEL>
參數喺後面,例如
cantofilter main.py --input input.txt --type cantonese > output.txt
噉樣輸出嘅 output.txt
就會係純粵文句子。如果想剩係要官話、官粵混合或者中性文本,將個 --type
參數定成 mandarin
、mixed
、neutral
就得。
僅輸出標籤
你亦都可以剩係輸出啲句子嘅分類結果,用 --type label
就得:
cantofilter main.py --input input.txt --type label > output.txt
噉樣嘅 output.txt
剩得一列,全部都係分類標籤。
依賴
Python >= 3.6
Cantonese text filter
This is a text filter for Cantonese, designed for filtering Cantonese text corpus. It classifies input sentences with four output labels:
cantonese
: Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度mandarin
: Pure Mandarin text, contains Mandarin-feature words. E.g. 你在哪裏mixed
:Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的neutral
:No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書
The filter is regex rule-based, by detecting Mandarin and Cantonese feature characters and words. If a sentence contains both Cantonese and Mandarin feature words, then it is a mixed-Cantonese-Mandarin sentence. If it contains neither features, it is a no-feature, neutral Chinese text.
Note: This filter assumes all input text in Traditional Chinese characters. If you want to filter texts written in simplified characters, please convert them into Traditional characters first. We recommend using OpenCC to do the conversion.
How to use
Install the package with pip first
pip install canto-filter
This package can be used in python codes, or as a CLI tool.
Python function usage
There is only one function in this package, judge()
, which accepts a string input and outputs one of the labels:
from cantofilter import judge
print(judge('你喺邊度')) # cantonese
print(judge('你在哪裏')) # mandarin
print(judge('是咁的')) # mixed
print(judge('去學校讀書')) # neutral
CLI usage
Assume an input text file, e.g. input.txt
where each line is a sentence.
Output both labels and original texts
Then run
cantofilter --input input.txt > output.txt
There will be a output.txt
which has two columns. The first column is the language label, and the second column is the original input text.
Output only text of one class
If you want only one type of text, use the --type <LABEL>
argument. Say if you want pure Cantonese text only:
cantofilter --input input.txt --type cantonese > output.txt
The output.txt
will contain only Cantonese text.
Output label only
If you want the classification labels, use --type label
like this:
cantofilter main.py --input input.txt --type label > output.txt
Then your output.txt
will contain only classification results of the input sentences.
Requirement
Python >= 3.6
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for canto_filter-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33011e20f40d801fb53b89fa517622629dabb5ac4c5a6abcc87e730095dc8e4c |
|
MD5 | ce90a824b94ea013ff844537a6768e9a |
|
BLAKE2b-256 | 4ccc96f769341cecf2c56cf32b411a8de672be2c5446ca6a276f9bea28e71479 |