Skip to main content

Gender Detection Tool by David Arroyo MEnéndez

Project description

Logo

img

Name

damegender is a gender detection tool from the name coded by David Arroyo MEnéndez (DAME)

Why?

  • If you want determine gender gap in free software projects or mailing lists.
  • If you don't know the gender about a name
  • If you want research with statistics about why a name is related with males or females.
  • If you want use a free gender detection tool from a name from a command with open data.
  • If you want use the main solutions in gender detection (genderize, genderapi, namsor, nameapi and gender guesser) from a command.

DAMe Gender is for you!

Tell me about DAMe Gender on Youtube

img

Install

Docker Image

# Build the container image
$ docker build . -t damegender/damegender:latest

# Run the container
$ docker run -ti damegender/damegender:latest main.py David

Installing Software

Posible Debian/Ubuntu dependencies

$ sudo apt-get install python3-nose-exclude python3-dev dict dict-freedict-eng-spa dict-freedict-spa-eng dictd

From sources

$ git clone https://github.com/davidam/damegender
$ cd damegender
$ pip3 install -r requirements.txt

With python package

$ python3 -m venv /tmp/d
$ cd /tmp/d
$ source bin/activate
$ pip install --upgrade pip
$ pip3 install damegender
$ cd lib/python3.5/site-packages/damegender
$ python3 main.py David

To install apis extra dependencies:

$ pip3 install damegender[apis]

To install mailing lists and repositories extra dependencies:

$ pip3 install damegender[mails_and_repositories]

To install all posible dependencies

$ pip3 install damegender[all]

Obtaining an api key

Currently you can need an api key from:

You can execute:

$ python3 apikeyadd.py

To configure your api key

Configuring nltk

$ python3
>>> import nltk
>>> nltk.download('names')

Check test

All unit tests

$ nosetest3 test

Using Docker image

$ docker run -ti --entrypoint nosetests damegender/damegender:latest test

Single unit test

$ nosetests3 test/test_dame_sexmachine.py:TddInPythonExample.test_string2array_method_returns_correct_result

Using Docker image

$ docker run -ti --entrypoint nosetests damegender/damegender:latest test/test_dame_sexmachine.py:TddInPythonExample.test_string2array_method_returns_correct_result

Tests from commands

$ cd src/damegender
$ ./testsbycommands.sh         # It must run for you
$ ./testsbycommandsextralocal.sh    # You will need all dependencies with: $ pip3 install damegender[all]
$ ./testsbycommandsextranet.sh    # You will need api keys

Execute program

# Detect gender from a name (INE is the dataset used by default)
$ python3 main.py David
David gender is male
 363559  males for David from INE.es
0 females for David from INE.es

# Detect gender from a name only using machine learning (experimental way)
$ python3 main.py Mesa --ml=nltk
Mesa gender is female
0 males for Mesa from INE.es
0 females for Mesa from INE.es

# Find your name in different countries
$ python3 nameincountries.py David
grep -i " David " files/names/nam_dict.txt > files/grep.tmp
males: ['Albania', 'Armenia', 'Austria', 'Azerbaijan', 'Belgium', 'Bosnia and Herzegovina', 'Czech Republic', 'Denmark', 'East Frisia', 'France', 'Georgia', 'Germany', 'Great Britain', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Kazakhstan/Uzbekistan', 'Luxembourg', 'Malta', 'Norway', 'Portugal', 'Romania', 'Slovenia', 'Spain', 'Sweden', 'Swiss', 'The Netherlands', 'USA', 'Ukraine']
females: []
both: []

# Count gender from a git repository
$ python3 git2gender.py https://github.com/chaoss/grimoirelab-perceval.git --directory="/tmp/clonedir"
The number of males sending commits is 15
The number of females sending commits is 7

# Count gender from a mailing list
$ cd files/mbox
$ wget -c http://mail-archives.apache.org/mod_mbox/httpd-announce/201706.mbox
$ cd ..
$ python3 mail2gender.py http://mail-archives.apache.org/mod_mbox/httpd-announce/

# Use an api to detect the gender
$ python3 api2gender.py Leticia --surname="Martin" --api=namsor
female
scale: 0.99

# Google popularity for a name
$ python3 gendergoogle.py Leticia
Google results of Leticia as male: 42300
Google results of Leticia as female: 63400

# Give me informative features
$ python3 infofeatures.py
Females with last letter a: 0.4705246078961601
Males with last letter a: 0.048672566371681415
Females with last letter consonant: 0.2735841767750908
Males with last letter consonant: 0.6355328972681801
Females with last letter vocal: 0.7262612995441552
Males with last letter vocal: 0.3640823393612928


# Download results from an api and save in a file
$ python3 downloadjson --csv=files/names/min.csv --api=genderize
$ cat files/names/genderizefiles_names_min.csv.json

# To measure success
$ python3 accuracy.py --csv=files/names/min.csv
################### NLTK!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 0, 1, 0, 0]
Dame Gender accuracy: 0.875

$ python3 accuracy.py --api="genderize" --csv=files/names/min.csv
################### Genderize!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
Genderize accuracy: 1

$ python3 confusion.py --csv="files/names/partial.csv" --api=nameapi --jsondownloaded="files/names/nameapifiles_names_partial.csv.json"
A confusion matrix C is such that Ci,j is equal to the number of observations known to be in group i but predicted to be in group j.
If the classifier is nice, the diagonal is high because there are true positives
Nameapi confusion matrix:

[[ 3, 0, 0]
 [ 0, 15, 1]]


# To analyze errors guessing names from a csv
$ python3 errors.py --csv="files/names/all.csv" --api="genderguesser"
Gender Guesser with files/names/all.csv has:
+ The error code: 0.22564457518601835
+ The error code without na: 0.026539047204698716
+ The na coded: 0.20453365634192766
+ The error gender bias: 0.0026103980857080703

# To deploy a graph about correlation between variables
$ python3 corr.py
$ python3 corr.py --csv="categorical"
$ python3 corr.py --csv="nocategorical"
# To create files from scripts. Example: the pickle models, or csv processed from original files.
$ python3 postinstall.py
# Experiments to determine features with weight (not finished)
$ python3 pca-components.py --csv="files/features_list.csv" # To determine number of components
$ python3 pca-features.py                                   # To understand the weight between variables for a target

Benchmarking

Market Study

  Gender API gender-guesser genderize.io NameAPI NamSor damegender
Database size 431322102 45376 114541298 1428345 4407502834 57282
Regular data updates yes no no yes yes yes, developing
Handles unstructured full name strings yes no no yes no yes
Handles surnames yes no no yes yes yes
Handles non-Latin alphabets partially no partially yes yes no
Implicit geo-localization yes no no yes yes no
Exists locale yes yes yes yes yes yes
Assingment type probilistic binary probabilistic probabilistic probabilistic probabilistic
Free parameters totalnames, probability gender probability, count confidence scale totalnames, count
Prediction no no no no no yes
Free license no yes no no no yes
API yes no yes yes yes future
free requests limited yes (200) unlimited yes yes yes unlimited

(Checked: 2019/06/27)

Accuracy

Name Accuracy Precision F1score Recall
Genderapi 0.9687686966482124 0.9717050018254838 0.9637877964874163 1.0
Genderize 0.926775 0.9761303240374678 0.9655113956503119 1.0
Namsor 0.8672551055728626 0.9730097087378641 0.9236866359447006 1.0
Gender Guesser 0.7743554248139817 0.9848151408450704 0.8715900233826968 1.0
Damegender 0.7452405676704742 0.8789548887528067 0.8789548887528067 1.0

(Checked: 2019/10 until 2019/12)

In Damegender we are using nltk and INE.es dataset in test. We hope better results with more languages.

Machine Learning Algorithms in DameGender These results are experimental, we are improving the choosing of features.

  • Stochastic Gradient Descendent accuracy: 0.5873374788015828
  • Support Vector Machines accuracy: 0.7049180327868853
  • Gaussian Naive Bayes accuracy: 0.5960994912379876
  • Multinomial Naive Bayes accuracy: 0.5876201243640475
  • Bernoulli Naive Bayes accuracy: 0.5962408140192199
  • Dame Gender (nltk bayes) accuracy: 0.6677501413227812
  • Random Forest accuracy: 0.3364895421141888

Confusion Matrix

  1. Genderguesser

    [[ 1686, 78, 204]
    [ 139, 3326, 346]]
    
  2. Genderize

    [[ 1742, 75, 151]
     [ 242, 3157, 412]]
    
  3. Namsor

    [[ 1686, 78, 204]
     [ 139, 3326, 346]]
    
  4. Nameapi

    [[ 3126, 93, 592]
     [75, 1616, 277]]
    
  5. Dame Gender

    [[ 1692, 276, 0]
    [ 778, 3033, 0]]
    

    In this version of Dame Gender, we are not considering decide names as undefined.

Errors with files/names/all.csv has:

API error code error code without na na coded error gender bias
Damegender 0.2547594323295258 0.2547594323295258 0.0 -0.04949809622706819
GenderApi 0.16666666666666666 0.16666666666666666 0.0 -0.16666666666666666
Gender Guesser 0.2255105572862582 0.026962383126766687 0.20404984423676012 0.0030441400304414
Namsor 0.16666666666666666 0.16666666666666666 0.0 0.16666666666666666

Performance

  • GenderGuesser accuracy: 0.6902204635387225

real 160m58.742s user 44m47.532s sys 0m56.024s

  • Dame Gender accuracy: 0.6677501413227812

real 129m23.082s user 53m12.640s sys 0m32.040s

Statistics for damegender

Some theory could be useful to understand some commands

Measuring success and fails

To guess the sex, we have an true idea (example: female) and we obtain a result with a method (example: using an api, querying a dataset or with a machine learning model). The guessed result could be male, female or perhaps unknown. Remember some definitions about results about this matter:

True positive is find a value guessed as true if the value in the data source is positive.

True negative is find a value guessed as true if the the value in the data source is negative.

False positive is find a value guessed as false if the the value in the data source is positive.

False negative is find a value guessed as false if the the value in the data source is negative.

So, we can find a vocabulary for measure true, false, success and errors. We can make a summary in the gender name context about mathematical concepts:

Precision is about true positives between true positives plus false positives

(femalefemale + malemale ) /
(femalefemale + malemale + femalemale)

Recall is about true positives between true positives plus false negatives.

(femalefemale + malemale ) /
(femalefemale + malemale + malefemale + femaleundefined + maleundefined)

Accuray is about true positives between all.

(femalefemale + malemale ) /
(femalefemale + malemale + malefemale + femalemale + femaleundefined + maleundefinedxs)

The F1 score is the harmonic mean of precision and recall taking both metrics into account in the following equation:

2 * (
(precision * recall) /
(precision + recall))

In Damengender, we are using accuracy.py to apply these concepts. Take a look to practice:

$ python3 accuracy.py --api="damegender" --measure="f1score" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="damegender" --measure="recall" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="damegender" --measure="precision" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="damegender" --measure="accuracy" --csv="files/names/partialnoundefined.csv"

$ python3 accuracy.py --api="genderguesser" --measure="f1score" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="genderguesser" --measure="recall" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="genderguesser" --measure="precision" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="genderguesser" --measure="accuracy" --csv="files/names/partialnoundefined.csv"

Error coded is about the true is different than the guessed:

(femalemale + malefemale + maleundefined + femaleundefined) /
(malemale + femalemale + malefemale +
femalefemale + maleundefined + femaleundefined)

Error coded without na is about the true is different than the guessed, but without undefined results.

(maleundefined + femaleundefined) /
(malemale + femalemale + malefemale +
femalefemale + maleundefined + femaleundefined)

Error gender bias is to understand if the error is bigger guessing males than females or viceversa.

(malefemale - femalemale) /
(malemale + femalemale + malefemale + femalefemale)

The weighted error is about the true is different than the guessed, but giving a weight to the guessed as undefined.

(femalemale + malefemale +
+ w * (maleundefined + femaleundefined)) /
(malemale + femalemale + malefemale + femalefemale +
+ w * (maleundefined + femaleundefined))

In Damengeder, we have coded errors.py to implement the different definitions in diffrent apis.

The confusion matrix creates a matrix between the true and the guess. If you have this confusion matrix:

[[ 2, 0, 0]
 [ 0, 5, 0]]

It means, I have 2 females true and I've guessed 2 females and I've 5 males true and I've guessed 5 males. I don't have errors in my classifier.

[[ 2  1  0]
[ 2 14  0]

It means, I have 2 females true and I've guessed 2 females and I've 14 males true and I've guessed 14 males. 1 female was considered male, 2 males was considered female.

In Damegender, we have coded confusion.py to implement this concept with the different apis.

PCA

Concepts

The dispersion measures between 1 variables are: variance, standard deviation, …

img

If you have 2 variables, you can write a formula so similar to variance.

img

If you have 3 variables or more, you can write a covariance matrix.

img

In essence, an eigenvector v of a linear transformation T is a non-zero vector that, when T is applied to it, does not change direction. Applying T to the eigenvector only scales the eigenvector by the scalar value λ, called an eigenvalue.

img

A feature vector is constructed taking the eigenvectors that you want to keep from the list of eigenvectors.

The new dataset take the transpose of the vector and multiply it on the left of the original data set, transposed.

FinalData = RowFeatureVector x RowDataAdjust

We can choose PCA using the covariance method as opposed to the correlation method.

The covariance method has the next steps:

  1. Organize the data set
  2. Calculate the empirical mean
  3. Calculate the deviations from the mean
  4. Find the covariance matrix
  5. Find the eigenvectors and eigenvalues of the covariance matrix
  6. Rearrange the eigenvectors and eigenvalues
  7. Compute the cumulative energy content for each eigenvector
  8. Select a subset of the eigenvectors as basis vectors
  9. Project the z-scores of the data onto the new basis

The correlation method has the next steps:

  1. Compute the correlation matrix
  2. Solve for the correlation roots of R (product of eigenvalues)
  3. Compute the first column of the V matrix
  4. Compute the remaining columns of the V matrix
  5. Compute the L(1/2) matrix
  6. Compute the communality
  7. Diagonal elements report how much of the variability is explained
  8. Compute the coefficient matrix
  9. Compute the principal factors

Choosing components

We can choose components with:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--csv')
args = parser.parse_args()

#filepath = 'files/features_list.csv' #your path here
data = np.genfromtxt(args.csv, delimiter=',', dtype='float64')

scaler = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scaler.fit_transform(data[1:, 0:8])

#Fitting the PCA algorithm with our Data
pca = PCA().fit(data_rescaled)
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

img

Taking a look to the image. We can choose 6 components.

Load Dataset

We choose the file all.csv to generate features and a list to determine gender (male or female)

from pprint import pprint
import pandas as pd
import matplotlib.pyplot as plt
from app.dame_sexmachine import DameSexmachine
from app.dame_gender import Gender

## LOAD DATASET
g = Gender()
g.features_list2csv(categorical="both", path="files/names/all.csv")
features = "files/features_list.csv"

print("STEP1: N COMPONENTS + 1 TARGET")

x = pd.read_csv(features)
print(x.columns)

y = g.dataset2genderlist(dataset="files/names/all.csv")
print(y)

Standarize the data

print("STEP2: STANDARIZE THE DATA")
from sklearn.preprocessing import StandardScaler
# Standardizing the features
x = StandardScaler().fit_transform(x)

Pca Projection to N Dimensions

Finally, we create the pca transform with 6 dimensions and we add the target component.

from sklearn.decomposition import PCA
pca = PCA(n_components=6)
principalComponents = pca.fit_transform(x)
print("STEP3: PCA PROJECTION")
pprint(principalComponents)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2', 'principal component 3', 'principal component 4', 'principal component 5', 'principal component 6'])

target = pd.DataFrame(data = y, columns = ['target component'])

print(principalDf.join(target))

Analize components to determine gender in names

first\\letter last\\letter last\\letter\\a first\\letter\\vocal last\\letter\\vocal last\\letter\\consonant target component
-0.2080025204 -0.3208958517 0.2352509625 0.2113242731 **0.6095269139** **-0.6095269139** -0.1035071139
**-0.6037951881** **0.5174873789** -0.4252467151 0.4278794455 0.0388287435 -0.0388287435 -0.0265942125
0.1049343046 0.1158117877 -0.2867605971 -0.3473950734 0.0901034539 -0.0901034539 -0.8697264971
0.2026467275 0.3142402839 **0.630802294** **0.5325769702** -0.1291229841 0.1291229841 -0.3811720011

In this analysis, we can observe 4 components.

The first component is about if the last letter is vocal or consonant. If the last letter is vocal we can find a male and if the last letter is a consonant we can find a male.

The second component is about the first letter. The last letter is determing females and the first letter is determing males.

The third component is not giving relevant information.

The fourth component is giving tha lastlettera and the firstlettervocal is for females.

Speeches, Seminars, Expressions of Support

Beautiful Snakes

img

License

Copyright (C) 2019 David Arroyo Menendez Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in GNU Free Documentation License.

img

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

damegender-0.2.5.tar.gz (32.2 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page