Skip to main content

package for generating Chinese disyllabic nonwords

Project description

Package description

This package is developed based on strokes and pinyin. Use this package to generate Chinese disyllabic nonwords.

Installation

from chinese_nonwords import ChineseNonwords

The ChineseNonwords function takes in the following arguments, in the following order:

  • stroke_min, stroke_max (min: 1; max: 25):
    • the minimum and maximum number of stroke a character has
  • num_nei_min, num_nei_max (min: 8; max: 307):
    • the minimum and maximum number of phonological neighborhood a character has
  • logfreq_min, logfreq_max (min: 0; max: 6.31):
    • the minimum and maximum number of frequency (log) of a character
  • N:
    • the number of disyllabic words to be generated (default=10)
  • random_state:
    • random state for sampling (default=42)

Usage

Generate disyllabic nonwords

from chinese_nonwords import ChineseNonwords
cnw = ChineseNonwords.generate_nonwords(stroke_min=2, 
                                        stroke_max=18, 
                                        num_nei_min=20, 
                                        num_nei_max=300, 
                                        logfreq_min=4, 
                                        logfreq_max=6, 
                                        N=10, 
                                        random_state=42)

Once specified, the run the generate_nonwords() function to get a tabulated list of nonwords. The pinyin of these nonwords were cross-checked with the [SUBTLEX-CH][1] to make sure it does not appear in the given list of known disyllabic words. The frequency information is extracted from [SUBTLEX-CH][1], stroke count from the strokes package, and the rest of the lexical properties from Mandarin-Neighborhood-Statistics.

Note that the length of the output is not always the same as specified (N), as there are nonwords that are phonologically similar to real disyllabic words, which are excluded. To generate another list with the same arguments, change random_state to a different value.

print(cnw)
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
|    | Char-1   | Char-2   |   Logfreq-1 |   Logfreq-2 |   Stroke-1 |   Stroke-2 |   HD-1 |   HD-2 |   NumNeighbor-1 |   NumNeighbor-2 |
+====+==========+==========+=============+=============+============+============+========+========+=================+=================+
|  0 | 求       | 查       |        4.36 |        4.49 |          7 |          9 |      9 |      6 |             219 |             219 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
|  1 | 名       | 空       |        4.64 |        4.19 |          6 |          8 |      7 |      1 |             219 |             277 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
|  2 | 妈       | 何       |        4.9  |        4.58 |          6 |          7 |      3 |     12 |             278 |             219 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
|  3 | 法       | 比       |        4.78 |        4.71 |          8 |          4 |      2 |      5 |             261 |             262 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
|  4 | 乐       | 到       |        4.42 |        5.46 |          5 |          8 |      9 |      8 |             304 |             305 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+
|  5 | 表       | 但       |        4.5  |        5.09 |          8 |          7 |      4 |     10 |             260 |             305 |
+----+----------+----------+-------------+-------------+------------+------------+--------+--------+-----------------+-----------------+

Generate disyllabic words

Similarly, you can use the generate_words() function to generate a list of disyllabic words that meet the specification.

from chinese_nonwords import ChineseNonwords
cw = ChineseNonwords.generate_words(stroke_min=2, 
                                    stroke_max=18, 
                                    num_nei_min=100, 
                                    num_nei_max=300, 
                                    logfreq_min=3, 
                                    logfreq_max=6, 
                                    N=10, 
                                    random_state=42)

The output is in the same format as the nonword list.

print(cw)
+----+--------+-----------+---------------------+----------+----------------+
|    | word   |   logfreq |   homophone_density |   stroke |   num_neighbor |
+====+========+===========+=====================+==========+================+
|  0 | 思考   |      3.08 |                 8   |      7.5 |          268.5 |
+----+--------+-----------+---------------------+----------+----------------+
|  1 | 家里   |      3.66 |                11   |      8.5 |          270   |
+----+--------+-----------+---------------------+----------+----------------+
|  2 | 迷人   |      3.03 |                 6.5 |      5.5 |          219.5 |
+----+--------+-----------+---------------------+----------+----------------+
|  3 | 交给   |      3.25 |                 7.5 |      7.5 |          268   |
+----+--------+-----------+---------------------+----------+----------------+
|  4 | 工具   |      3.02 |                13.5 |      5.5 |          291.5 |
+----+--------+-----------+---------------------+----------+----------------+
|  5 | 多久   |      3.64 |                 5   |      4.5 |          240   |
+----+--------+-----------+---------------------+----------+----------------+
|  6 | 理解   |      3.71 |                 6.5 |     12   |          262   |
+----+--------+-----------+---------------------+----------+----------------+
|  7 | 情绪   |      3.16 |                 7   |     11   |          262.5 |
+----+--------+-----------+---------------------+----------+----------------+
|  8 | 天气   |      3.1  |                 7.5 |      4   |          291.5 |
+----+--------+-----------+---------------------+----------+----------------+
|  9 | 酒吧   |      3.47 |                 3.5 |      8.5 |          136.5 |
+----+--------+-----------+---------------------+----------+----------------+
References

[1] SUBTLEX-CH: Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS one, 5(6), e10729.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_nonwords-0.0.19.tar.gz (571.5 kB view details)

Uploaded Source

Built Distribution

chinese_nonwords-0.0.19-py3-none-any.whl (574.4 kB view details)

Uploaded Python 3

File details

Details for the file chinese_nonwords-0.0.19.tar.gz.

File metadata

  • Download URL: chinese_nonwords-0.0.19.tar.gz
  • Upload date:
  • Size: 571.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for chinese_nonwords-0.0.19.tar.gz
Algorithm Hash digest
SHA256 75968b2ceae3a5b76e9e8132b470ab24a299eada40b46afbce174066453da272
MD5 9412ff67ca4bfff4823ff7af58e7e0a6
BLAKE2b-256 4e373bb9b4f3e81327d8339d00a32e24f33bbd9624dbdf6c2c1eaa3b98e78ee8

See more details on using hashes here.

File details

Details for the file chinese_nonwords-0.0.19-py3-none-any.whl.

File metadata

File hashes

Hashes for chinese_nonwords-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 8e918b90bfce03feca5389c98c0e6a9bc153ed5453ca57dcecb5bc23621989c5
MD5 71166726904c9b9626f6459a6322db60
BLAKE2b-256 b510a8d8418f8ca3fe1cf6750fd3b1ceb47ff81c4e9c86e5ed3bfe9efe5b5f2a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page