A gender classifier for Chinese given names
Project description
namesex
Namesex is a package that predicts the gender tendency of a Chinese given name. This module comes with two prediction models trained on 10,730 Chinese given names (in traditional Chinese) with reliable gender lables collected from public data. The first prediction model is a random forest classifier that can be invoked by predict(). This model takes three types of features: the given name, the unigram of given name, and a vector of size one hundred extracted from a skip-gram word-to-vector model trained separately using a news corpus collected from tw.yahoo.com. This news corpus contains 87,848,812 Chinese characters.
The second prediction model is a L2 regularized logistic regression that can be invoked by predict_logic(). This model uses the given names and the unigrams of given names only. Both prediction methods take a list of names and output predicted gender tendency (1 for male and 0 for female) or probability of being a male name.
While gensim was used to train the skip-gram word2vec model, this project does not depend on gensim because the trained model was extracted to a dictionary structure for the convenient use of this project. This project, nonetheless, depends on numpy, scipy, and sklearn. Windows users may want to install numpy, scipy, and sklearn using pre-compiled binary packages before installing namesex via pip. If you just want something that “just work” and does not want to install sklearn, consider using the sister project, namesex_light, that depends only on numpy. Namesex_light provides the same preduction function using a regularized logistic regression trained on the same dataset. Namesex_light should be faster than the predict() here. The prediction accuracy of namesex_light, however, is lower than the predict() function in namesex.
Additional information about namesex and namesex_light can be found in another document (in Chinese).
The prediction performance of the random forest and logistic regression models evaluated by ten-fold cross validation is listed below.
Random Forest
Metric |
Performance |
Performance Std. Dev. |
Accuracy |
0.9486 |
0.007072 |
F1 |
0.9470 |
0.007963 |
Precision |
0.9525 |
0.008399 |
Recall |
0.9417 |
0.012985 |
Logloss |
161.54 |
4.101283 |
L2 Regularized Logistic Regression
Metric |
Performance |
Performance Std. Dev. |
Accuracy |
0.8957 |
0.007327 |
F1 |
0.8920 |
0.007873 |
Precision |
0.8852 |
0.012238 |
Recall |
0.8991 |
0.008936 |
Logloss |
114.35 |
6.413972 |
The random forest model clearly has a higher accuracy and F1 score. We have also tested the k-nearest-neighbor (KNN) model (not reported here). KNN and logistic regression have a similar level of performance, and was excluded for obvious reasons.
Use pip/pip3 to install namesex.:
pip install namesex
To use namesex, pass in an array or list of given names to predict(). For each element in the input list, predict() returns 1 or 0 for male or female prediction. Set “predprob = True” to return probability of being a male name. The following is a simple sample code.:
>>> import namesex >>> ns = namesex.namesex() >>> ns.predict(['民豪', '愛麗', '志明']) array([1, 0, 1]) >>> ns.predict(['民豪', '愛麗', '志明'], predprob=True) array([0.8245 , 0.25695238, 0.85 ])
Note that namesex was trained using Chinese given names only. However, it may be used to classifier translated names as well:
>>> ns.predict(['莎拉波娃', '阿波羅', '雷', '艾美', '布蘭妮', '瑪麗亞']) array([0, 1, 1, 0, 0, 0])
The model was trained using given names only. As a result, for the best performance, the input data should be preprocessed to keep given names only.:
>>> ns.predict(['黃志明春嬌', '黃志明', '志明', '黃春嬌', '春嬌'], predprob = True) array([0.61825 , 0.79039286, 0.85 , 0.3646 , 0.3716 ])
In the above example, the family name has a minor effect on the prediction. Concatenating a male and female name somehow neutralize (toward 0.5) the gender tendency.
Testing Dataset
This package comes with a small testing dataset that was not used for model training. The following sample code illustrate a simple usage.:
>>> testdata = namesex.testdata() >>> ns = namesex.namesex() >>> pred = ns.predict(testdata.gname) >>> pred2 = ns.predict_logic(testdata.gname) >>> import numpy as np >>> accuracy = np.mean(pred == testdata.sex) >>> print(" Prediction accuracy (random forest) = {}".format(accuracy)) Prediction accuracy (random forest) = 0.8921568627450981 >>> accuracy2 = np.mean(pred2 == testdata.sex) >>> print(" Prediction accuracy (logistic reg) = {}".format(accuracy2)) Prediction accuracy (logistic reg) = 0.8627450980392157
For both methods, the accuracy is slightly lower compared to the accuracy of ten-fold cross valudation. Random forest is still better compared to logistic regression.
Model Training
The module come with the training data. It is possible to train the model by yourself.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file namesex-0.2.5.tar.gz
.
File metadata
- Download URL: namesex-0.2.5.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3f87580529ef6488d8a5e8bee352da16a9ba464497e99a5cb7f837a1a2bd78f |
|
MD5 | d1c7153d31f137635218aee8e4d1365b |
|
BLAKE2b-256 | 7c1824cabcb66a0d37649e1218ec9c7bb636efd0c6522f489ee52a472e1820b7 |
File details
Details for the file namesex-0.2.5-py3-none-any.whl
.
File metadata
- Download URL: namesex-0.2.5-py3-none-any.whl
- Upload date:
- Size: 22.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3851a8ac19592fa9896dd5c39c0dfb1cdd8dd178ecb839b61b3a33bffa88bc63 |
|
MD5 | bcfd5c4b52fdac2e8ed9d9c696c61c76 |
|
BLAKE2b-256 | a4fe4a1d033e2c1be4001bef58bf779bfd2a0168449ba575995001c6730b452e |