A lightweight gender classifier for Chinese given names
Project description
namesex_light
Namesex_light is a lighweight package that predicts the gender tendency of Chinese given names. This module comes with a L2 regularized logistic regression trained on 10,730 Chinese given names (in traditional Chinese) with reliable gender lables collected from public data. The predict() function takes a list of names and output predicted gender tendency (1 for male and 0 for female) or probability of being a male name. Namesex_light has a sister project, namesex, that performs similar tasks with higher accuracy.
Additional information about namesex and namesex_light can be found in another document (in Chinese).
The prediction performance evaluated by ten-fold cross validation is:
Metric |
Performance |
Performance Std. Dev. |
Accuracy |
0.8957 |
0.007327 |
F1 |
0.8920 |
0.007873 |
Precision |
0.8852 |
0.012238 |
Recall |
0.8991 |
0.008936 |
Logloss |
114.35 |
6.413972 |
Use pip/pip3 to install namesex_light.:
pip install namesex_light
To use namesex_light, pass in an array or list of given names to predict(). For each element in the input list, predict() returns 1 or 0 for male or female prediction. Set “predprob = True” to return probability of being a male name. The following is a simple sample code.:
>>> import namesex_light >>> nsl = namesex_light.namesex_light() >>> nsl.predict(['民豪', '愛麗', '志明']) array([1, 0, 1]) >>> nsl.predict(['民豪', '愛麗', '志明'], predprob=True) array([0.99968932, 0.00530066, 0.9938986 ])
Note that namesex_light was trained using Chinese given names only. However, it may be used to classifier translated names as well:
>>> nsl.predict(['阿波羅', '阿波羅', '雷', '艾美', '布蘭妮', '阿曼達']) array([1, 1, 1, 0, 0, 1])
This module is intended for a quick plug-and-play. The original training dataset is not included.
Testing Dataset
This package comes with a small testing dataset that was not used for model training. The following sample code illustrate a simple usage.:
>>> testdata = namesex_light.testdata() >>> nsl = namesex_light.namesex_light() >>> pred = nsl.predict(testdata.gname) >>> print("The first 5 given names are: {}".format(testdata.gname[0:5])) The first 5 given names are: ['翊如', '妤庭', '諆璋', '大閎', '和維'] >>> print(" and their sex: {}".format(testdata.sex[0:5])) and their sex: [0, 0, 1, 1, 1] >>> print(" and their predicted sex:{}".format(pred[0:5])) and their predicted sex:[0 0 1 1 1] >>> accuracy = np.sum(pred == testdata.sex) / len(pred) >>> print(" Prediction accuracy = {}".format(accuracy)) Prediction accuracy = 0.8627450980392157
Note that the accuracy is slightly lower compared to the accuracy of ten-fold cross valudation. I guess this is normal since this testset is collected from a source that is different from the training dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for namesex_light-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e426353bb0ed48cbf990a72ebaf16b69403fa7e4dbb6528d5912a77cd9c2f49c |
|
MD5 | 53a701708d79da924731d968fbf96e37 |
|
BLAKE2b-256 | 22da3a02cc2b5c629fff8c18f23922168c5e7c0385a58e31f28b8f9abb78f15a |