namesex

A gender classifier for Chinese given names

These details have not been verified by PyPI

Project links

Homepage

Project description

namesex

Namesex is a package that predicts the gender tendency of a Chinese given name. This module comes with two prediction models trained on 10,730 Chinese given names (in traditional Chinese) with reliable gender lables collected from public data. The first prediction model is a random forest classifier that can be invoked by predict(). This model takes three types of features: the given name, the unigram of given name, and a vector of size one hundred extracted from a skip-gram word-to-vector model trained separately using a news corpus collected from tw.yahoo.com. This news corpus contains 87,848,812 Chinese characters.

The second prediction model is a L2 regularized logistic regression that can be invoked by predict_logic(). This model uses the given names and the unigrams of given names only. Both prediction methods take a list of names and output predicted gender tendency (1 for male and 0 for female) or probability of being a male name.

While gensim was used to train the skip-gram word2vec model, this project does not depend on gensim because the trained model was extracted to a dictionary structure for the convenient use of this project. This project, nonetheless, depends on numpy, scipy, and sklearn. Windows users may want to install numpy, scipy, and sklearn using pre-compiled binary packages before installing namesex via pip. If you just want something that “just work” and does not want to install sklearn, consider using the sister project, namesex_light, that depends only on numpy. Namesex_light provides the same preduction function using a regularized logistic regression trained on the same dataset. Namesex_light should be faster than the predict() here. The prediction accuracy of namesex_light, however, is lower than the predict() function in namesex.

Additional information about namesex and namesex_light can be found in another document (in Chinese).

The prediction performance of the random forest and logistic regression models evaluated by ten-fold cross validation is listed below.

Random Forest

Metric	Performance	Performance Std. Dev.
Accuracy	0.9486	0.007072
F1	0.9470	0.007963
Precision	0.9525	0.008399
Recall	0.9417	0.012985
Logloss	161.54	4.101283

L2 Regularized Logistic Regression

Metric	Performance	Performance Std. Dev.
Accuracy	0.8957	0.007327
F1	0.8920	0.007873
Precision	0.8852	0.012238
Recall	0.8991	0.008936
Logloss	114.35	6.413972

The random forest model clearly has a higher accuracy and F1 score. We have also tested the k-nearest-neighbor (KNN) model (not reported here). KNN and logistic regression have a similar level of performance, and was excluded for obvious reasons.

Use pip/pip3 to install namesex.:

pip install namesex

To use namesex, pass in an array or list of given names to predict(). For each element in the input list, predict() returns 1 or 0 for male or female prediction. Set “predprob = True” to return probability of being a male name. The following is a simple sample code.:

>>> import namesex
>>> ns = namesex.namesex()
>>> ns.predict(['民豪', '愛麗', '志明'])
array([1, 0, 1])
>>> ns.predict(['民豪', '愛麗', '志明'], predprob=True)
array([0.8245    , 0.25695238, 0.85      ])

Note that namesex was trained using Chinese given names only. However, it may be used to classifier translated names as well:

>>> ns.predict(['莎拉波娃', '阿波羅', '雷', '艾美', '布蘭妮', '瑪麗亞'])
array([0, 1, 1, 0, 0, 0])

The model was trained using given names only. As a result, for the best performance, the input data should be preprocessed to keep given names only.:

>>> ns.predict(['黃志明春嬌', '黃志明', '志明', '黃春嬌', '春嬌'], predprob = True)
array([0.61825   , 0.79039286, 0.85      , 0.3646    , 0.3716    ])

In the above example, the family name has a minor effect on the prediction. Concatenating a male and female name somehow neutralize (toward 0.5) the gender tendency.

Testing Dataset

This package comes with a small testing dataset that was not used for model training. The following sample code illustrate a simple usage.:

>>> testdata = namesex.testdata()
>>> ns = namesex.namesex()
>>> pred = ns.predict(testdata.gname)
>>> pred2 = ns.predict_logic(testdata.gname)
>>> import numpy as np
>>> accuracy = np.mean(pred == testdata.sex)
>>> print(" Prediction accuracy (random forest) = {}".format(accuracy))
 Prediction accuracy (random forest) = 0.8921568627450981
>>> accuracy2 = np.mean(pred2 == testdata.sex)
>>> print(" Prediction accuracy (logistic reg) = {}".format(accuracy2))
 Prediction accuracy (logistic reg) = 0.8627450980392157

For both methods, the accuracy is slightly lower compared to the accuracy of ten-fold cross valudation. Random forest is still better compared to logistic regression.

Model Training

The module come with the training data. It is possible to train the model by yourself.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.5

May 17, 2019

0.2.4

May 14, 2019

0.2.3

May 14, 2019

0.2.2

Jul 15, 2018

0.2.1

Jul 11, 2018

0.1.22

Jul 11, 2018

0.1.21

Jul 11, 2018

0.1.20

Jul 11, 2018

0.1.19

Jul 11, 2018

0.1.18

Jul 11, 2018

0.1.17

Jul 11, 2018

0.1.15

Jul 11, 2018

0.1.14

Jul 10, 2018

0.1.13

Jul 10, 2018

0.1.12

Jul 10, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

namesex-0.2.5.tar.gz (8.1 kB view details)

Uploaded May 17, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

namesex-0.2.5-py3-none-any.whl (22.8 MB view details)

Uploaded May 17, 2019 Python 3

File details

Details for the file namesex-0.2.5.tar.gz.

File metadata

Download URL: namesex-0.2.5.tar.gz
Upload date: May 17, 2019
Size: 8.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.5.2

File hashes

Hashes for namesex-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`e3f87580529ef6488d8a5e8bee352da16a9ba464497e99a5cb7f837a1a2bd78f`
MD5	`d1c7153d31f137635218aee8e4d1365b`
BLAKE2b-256	`7c1824cabcb66a0d37649e1218ec9c7bb636efd0c6522f489ee52a472e1820b7`

See more details on using hashes here.

File details

Details for the file namesex-0.2.5-py3-none-any.whl.

File metadata

Download URL: namesex-0.2.5-py3-none-any.whl
Upload date: May 17, 2019
Size: 22.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.5.2

File hashes

Hashes for namesex-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3851a8ac19592fa9896dd5c39c0dfb1cdd8dd178ecb839b61b3a33bffa88bc63`
MD5	`bcfd5c4b52fdac2e8ed9d9c696c61c76`
BLAKE2b-256	`a4fe4a1d033e2c1be4001bef58bf779bfd2a0168449ba575995001c6730b452e`

See more details on using hashes here.

namesex 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

namesex

Random Forest

L2 Regularized Logistic Regression

Testing Dataset

Model Training

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes