package for generating Chinese disyllabic nonwords
Project description
Package description
This package is developed based on strokes
and pinyin
. Use this package to generate Chinese disyllabic nonwords.
Installation
from chinese_nonwords import ChineseNonwords
The ChineseNonwords
function takes in the following arguments, in the following order:
- stroke_min, stroke_max (min: 1; max: 25):
- the minimum and maximum number of stroke a character has
- num_nei_min, num_nei_max (min: 8; max: 307):
- the minimum and maximum number of phonological neighborhood a character has
- logfreq_min, logfreq_max (min: 0; max: 6.31):
- the minimum and maximum number of frequency (log) of a character
- N:
- the number of disyllabic words to be generated (default=10)
- random_state:
- random state for sampling (default=42)
Usage
cnw = ChineseNonwords(stroke_min=2,
stroke_max=18,
num_nei_min=20,
num_nei_min=300,
logfreq_min=4,
logfreq_max=6,
N=10,
random_state=42)
Once specified, the run the generate()
function to get a tabulated list of nonwords. The pinyin of these nonwords were cross-checked with the [SUBTLEX-CH][1] to make sure it does not appear in the given list of known disyllabic words. The frequency information is extracted from [SUBTLEX-CH][1], stroke count from the strokes
package, and the rest of the lexical properties from Mandarin-Neighborhood-Statistics
.
my_cnw = cnw.generate()
print(my_cnw)
Note that the length of the output is not always the same as specified (N), as there are nonwords that are phonologically similar to real disyllabic words, which are excluded. To generate another list with the same arguments, change random_state
to a different value.
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
| | Char-1 | Char-2 | Logfreq-1 | Logfreq-2 | Stroke-1 | Stroke-2 | HD-1 | HD-2 | NumNeighbor-1 | NumNeighbor-2 |
+====+==========+==========+=============+=============+============+============+========+========+=================+=================+
| 0 | 求 | 查 | 4.36 | 4.49 | 7 | 9 | 9 | 6 | 219 | 219 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
| 1 | 名 | 空 | 4.64 | 4.19 | 6 | 8 | 7 | 1 | 219 | 277 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
| 2 | 妈 | 何 | 4.9 | 4.58 | 6 | 7 | 3 | 12 | 278 | 219 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
| 3 | 法 | 比 | 4.78 | 4.71 | 8 | 4 | 2 | 5 | 261 | 262 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
| 4 | 乐 | 到 | 4.42 | 5.46 | 5 | 8 | 9 | 8 | 304 | 305 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
| 5 | 表 | 但 | 4.5 | 5.09 | 8 | 7 | 4 | 10 | 260 | 305 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
References
[1] SUBTLEX-CH: Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS one, 5(6), e10729.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for chinese_nonwords-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd10eed7d7bdfaf2a2e121d7f8f35f65d8cb28bccbbe267a244c4cb81ccc95b2 |
|
MD5 | 4c9e86607f8f123f0ff629ea3788df43 |
|
BLAKE2b-256 | ab8726f9ee22b4b1bba901d9242d9b77b52b95b6be250100cd504d2b43bfb14d |