thai backchannel classifier - detect backchannels vs real responses in thai asr output
Project description
thai backchannel classifier
detects thai backchannel responses (fillers like ครับ, ค่ะ, อืม) vs real user input for voice ai systems.
why
thai voice bots using asr → llm → tts pipelines need to distinguish between backchannels (acknowledgment sounds that should be ignored) and real responses that need processing. simple exact matching fails on asr variants and misses edge cases.
approach
gradient boosting classifier with 23 handcrafted thai-specific features:
| feature | importance |
|---|---|
| remaining_ratio | 0.9098 |
| has_request | 0.0406 |
| has_negation | 0.0274 |
| particle_ratio | 0.0108 |
key idea: strip known backchannel components from the text, measure what's left (remaining_ratio). if nothing remains, it's a backchannel.
features
- remaining_ratio: strips known backchannel components, measures residual text
- polite particle detection (ครับ/ค่ะ/จ้ะ variants)
- backchannel sound patterns (อืม/อ๋อ/เออ with tone variants)
- question/negation/request/continuation markers
- handles asr misspellings (ค่า→ค่ะ, คับ→ครับ, อื้ม→อืม)
results
cross-validation
- 99.49% f1 (5-fold cv, gradient boosting)
- logistic regression baseline: 98.97% f1
full training set
precision recall f1-score support
real_response 1.00 1.00 1.00 96
backchannel 1.00 1.00 1.00 194
accuracy 1.00 290
test suite: 94/94 (100%)
the test suite (tests/test_classifier.py) covers:
backchannels (49 cases):
- basic polite particles: ครับ, ค่ะ, คับ, คะ, จ้ะ, จ้า
- dai + particle: ได้ครับ, ได้ค่ะ, ได้จ้ะ
- filler sounds: อืม, อือ, อื้อ, เออ, เอ่อ, อ่า
- oh: อ๋อ, อ๋อครับ, อ๋อค่ะ
- agreement: ใช่, จริง, ถูก, แน่นอน
- ok variants: โอเคครับ, โอเคค่ะ
- question-like: เหรอ, หรอ, งั้นเหรอ
- compound: ครับ ฮัลโหล, อ่ะ ใช่ๆๆ, อ่าฮะ ครับ
- asr tone variants: อื้ม, อ๊าา, อ้า, อึม
real responses (45 cases):
- greetings: สวัสดีครับ, ขอบคุณครับ
- negations: ไม่ครับ, ไม่ใช่ค่ะ, ยังครับ
- questions: ราคาเท่าไหร่ครับ, ทำไมครับ, กี่โมงครับ
- requests: ผมต้องการจองตั๋วครับ, ช่วยเช็คให้หน่อยได้ไหมครับ
- tricky edge cases (backchannel + continuation):
- "ใช่ แต่ว่า" → real (has continuation marker)
- "ครับ แล้วก็" → real
- "ได้ครับ แต่ขอเปลี่ยนวัน" → real
- "อ๋อ แล้วเรื่องที่สอง" → real
- "อืม แต่ว่าผมไม่แน่ใจ" → real
- short real: ไม่, ยัง, เอา, ได้เลย
usage
from detect import is_backchannel
is_bc, confidence = is_backchannel("ครับ") # (True, 1.0)
is_bc, confidence = is_backchannel("ไม่ครับ") # (False, 0.0)
is_bc, confidence = is_backchannel("ใช่ แต่ว่า") # (False, 0.0)
cli:
python detect.py ครับ
# 'ครับ' -> BACKCHANNEL (confidence: 1.0000)
testing
python -m pytest tests/ -v
or without pytest:
python tests/test_classifier.py
files
train.py- training script with all data + featuresdetect.py- inference module (import or cli)backchannel_model.pkl- trained model (~50kb)tests/test_classifier.py- comprehensive test suite (94 cases)
requirements
- python 3.8+
- scikit-learn
- numpy
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thai_backchannel-0.3.1.tar.gz.
File metadata
- Download URL: thai_backchannel-0.3.1.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88647e97e4c302d66a8301fc10385a8ddd1a4391cc9b3c17127dba7987f4cf32
|
|
| MD5 |
ce989a0f4b619104b921f22727a5d252
|
|
| BLAKE2b-256 |
dbd529669599554fc05ae05cf8150b8a47a99f6958bf431208952af90984a4a5
|
File details
Details for the file thai_backchannel-0.3.1-py3-none-any.whl.
File metadata
- Download URL: thai_backchannel-0.3.1-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
437ca2a9afe2f5aea20a537d61387d94b2df8b58478737b6db3a318300407ddf
|
|
| MD5 |
fe1838bce12436a78dd5063e081be577
|
|
| BLAKE2b-256 |
e4207521d549f146488d8edd28c15d42366a5e972acde30940b8a01d4759b129
|