Skip to main content

Gender Detection Tool by David Arroyo MEnéndez

Project description

Table of Contents

  1. Logo
  2. Name
  3. Why?
  4. Tell me about DAMe Gender on Youtube
  5. Install
    1. Docker Image
    2. Installing Software
      1. Possible Debian/Ubuntu dependencies
      2. From sources
      3. With python package
    3. Obtaining an api key
    4. Configuring nltk
  6. Check test
    1. All unit tests
      1. Using Docker image
    2. Single unit test
      1. Using Docker image
    3. Tests from commands
  7. Execute program
  8. Benchmarking
    1. Market Study
    2. Accuracy
    3. Accuracy (Damegender ML)
    4. Confusion Matrix
    5. Errors with files/names/all.csv has:
    6. Performance
  9. Statistics for damegender
    1. Measuring success and fails
    2. PCA
      1. Concepts
      2. Choosing components
      3. Load Dataset
      4. Standarize the data
      5. Pca Projection to N Dimensions
      6. Analyze components to determine gender in names
  10. Speeches, Seminars, Expressions of Support
  11. Beautiful Snakes
  12. Dame Music
  13. License

Logo

img

Name

damegender is a gender detection tool from the name coded by David Arroyo MEnéndez (DAME)

Why?

  • If you want determine gender gap in free software projects or mailing lists.
  • If you don't know the gender about a name
  • If you want research with statistics about why a name is related with males or females.
  • If you want use a free gender detection tool from a name from a command with open data.
  • If you want use the main solutions in gender detection (genderize, genderapi, namsor, nameapi and gender guesser) from a command.

DAMe Gender is for you!

Tell me about DAMe Gender on Youtube

img

Install

Docker Image

# Build the container image
$ docker build . -t damegender/damegender:latest

# Run the container
$ docker run -ti damegender/damegender:latest main.py David

Installing Software

Possible Debian/Ubuntu dependencies

$ sudo apt-get install python3-nose-exclude python3-dev dict dict-freedict-eng-spa dict-freedict-spa-eng dictd

From sources

$ git clone https://github.com/davidam/damegender
$ cd damegender
$ pip3 install -r requirements.txt

With python package

$ python3 -m venv /tmp/d
$ cd /tmp/d
$ source bin/activate
$ pip install --upgrade pip
$ pip3 install damegender
$ cd lib/python3.5/site-packages/damegender
$ python3 main.py David

To install apis extra dependencies:

$ pip3 install damegender[apis]

To install mailing lists and repositories extra dependencies:

$ pip3 install damegender[mails_and_repositories]

To install all possible dependencies

$ pip3 install damegender[all]

Obtaining an api key

Currently you can need an api key from:

You can execute:

$ python3 apikeyadd.py

To configure your api key

Configuring nltk

$ python3
>>> import nltk
>>> nltk.download('names')

Check test

All unit tests

$ nosetest3 test

Using Docker image

$ docker run -ti --entrypoint nosetests damegender/damegender:latest test

Single unit test

$ nosetests3 test/test_dame_sexmachine.py:TddInPythonExample.test_string2array_method_returns_correct_result

Using Docker image

$ docker run -ti --entrypoint nosetests damegender/damegender:latest test/test_dame_sexmachine.py:TddInPythonExample.test_string2array_method_returns_correct_result

Tests from commands

$ cd src/damegender
$ ./testsbycommands.sh         # It must run for you
$ ./testsbycommandsextralocal.sh    # You will need all dependencies with: $ pip3 install damegender[all]
$ ./testsbycommandsextranet.sh    # You will need api keys

Execute program

# Detect gender from a name (INE is the dataset used by default)
$ python3 main.py David
David gender is male
 363559  males for David from INE.es
0 females for David from INE.es

# Detect gender from a name only using machine learning (experimental way)
$ python3 main.py Mesa --ml=nltk
Mesa gender is female
0 males for Mesa from INE.es
0 females for Mesa from INE.es

# Detect gender from a name (all census and machine learning)
$ python3 main.py David --verbose
365196 males for David from INE.es
0 females for David from INE.es
1193 males for David from Uruguay census
5 females for David from Uruguay census
26645 males for David from United Kingdom census
0 females for David from United Kingdom census
3552580 males for David from United States of America census
12826 females for David from United States of America census
David gender predicted with nltk is male
David gender predicted with sgd is male
David gender predicted with svc is male
David gender predicted with gaussianNB is male
David gender predicted with multinomialNB is male
David gender predicted with bernoulliNB is male
David gender predicted with forest is male
David gender predicted with tree is male
David gender predicted with mlp is male

# Find your name in different countries
$ python3 nameincountries.py David
grep -i " David " files/names/nam_dict.txt > files/grep.tmp
males: ['Albania', 'Armenia', 'Austria', 'Azerbaijan', 'Belgium', 'Bosnia and Herzegovina', 'Czech Republic', 'Denmark', 'East Frisia', 'France', 'Georgia', 'Germany', 'Great Britain', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Kazakhstan/Uzbekistan', 'Luxembourg', 'Malta', 'Norway', 'Portugal', 'Romania', 'Slovenia', 'Spain', 'Sweden', 'Swiss', 'The Netherlands', 'USA', 'Ukraine']
females: []
both: []

# Count gender from a git repository
$ python3 git2gender.py https://github.com/chaoss/grimoirelab-perceval.git --directory="/tmp/clonedir"
The number of males sending commits is 15
The number of females sending commits is 7

# Count gender from a mailing list
$ cd files/mbox
$ wget -c http://mail-archives.apache.org/mod_mbox/httpd-announce/201706.mbox
$ cd ..
$ python3 mail2gender.py http://mail-archives.apache.org/mod_mbox/httpd-announce/

# Use an api to detect the gender
$ python3 api2gender.py Leticia --surname="Martin" --api=namsor
female
scale: 0.99

# Google popularity for a name
$ python3 gendergoogle.py Leticia
Google results of Leticia as male: 42300
Google results of Leticia as female: 63400

# Give me informative features
$ python3 infofeatures.py
Females with last letter a: 0.4705246078961601
Males with last letter a: 0.048672566371681415
Females with last letter consonant: 0.2735841767750908
Males with last letter consonant: 0.6355328972681801
Females with last letter vocal: 0.7262612995441552
Males with last letter vocal: 0.3640823393612928

# Download results from an api and save in a file
$ python3 downloadjson --csv=files/names/min.csv --api=genderize
$ cat files/names/genderizefiles_names_min.csv.json

# To measure success
$ python3 accuracy.py --csv=files/names/min.csv
################### NLTK!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 0, 1, 0, 0]
Dame Gender accuracy: 0.875

$ python3 accuracy.py --api="genderize" --csv=files/names/min.csv
################### Genderize!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
Genderize accuracy: 1

$ python3 confusion.py --csv="files/names/partial.csv" --api=nameapi --jsondownloaded="files/names/nameapifiles_names_partial.csv.json"
A confusion matrix C is such that Ci,j is equal to the number of observations known to be in group i but predicted to be in group j.
If the classifier is nice, the diagonal is high because there are true positives
Nameapi confusion matrix:

[[ 3, 0, 0]
 [ 0, 15, 1]]


# To analyze errors guessing names from a csv
$ python3 errors.py --csv="files/names/all.csv" --api="genderguesser"
Gender Guesser with files/names/all.csv has:
+ The error code: 0.22564457518601835
+ The error code without na: 0.026539047204698716
+ The na coded: 0.20453365634192766
+ The error gender bias: 0.0026103980857080703

# To deploy a graph about correlation between variables
$ python3 corr.py
$ python3 corr.py --csv="categorical"
$ python3 corr.py --csv="nocategorical"
# To create files from scripts. Example: the pickle models, or csv processed from original files.
$ python3 postinstall.py
# Experiments to determine features with weight (not finished)
$ python3 pca-components.py --csv="files/features_list.csv" # To determine number of components
$ python3 pca-features.py                                   # To understand the weight between variables for a target

# Counting surnames
$ python3 surname.py Gil --total=es
There are 140004 people using Gil in Spain

$ python3 surname.py Lenon --total=us
There are 837 people using Lenon in United States of America

# Measuring Ethnicity of surnames
$ python3 ethnicity.py Smith
In United States of America the percentages about the race of Smith surname is:
White: 73.35
Black: 22.22
Hispanic: 1.56
Asian Pacific Indian American: 0.40
American Indian and Alaska Native: 0.85
Various races: 1.63

Benchmarking

Market Study

  Gender API gender-guesser genderize.io NameAPI NamSor damegender
Database size 431322102 45376 114541298 1428345 4407502834 57282
Regular data updates yes no no yes yes yes, developing
Handles unstructured full name strings yes no no yes no yes
Handles surnames yes no no yes yes yes
Handles non-Latin alphabets partially no partially yes yes no
Implicit geo-localization yes no no yes yes no
Exists locale yes yes yes yes yes yes
Assingment type probilistic binary probabilistic probabilistic probabilistic probabilistic
Free parameters totalnames, probability gender probability, count confidence scale totalnames, count
Prediction no no no no no yes
Free license no yes no no no yes
API yes no yes yes yes future
free requests limited yes (200) unlimited yes yes yes unlimited

(Checked: 2019/06/27)

Accuracy

Name Accuracy Precision F1score Recall
Genderapi 0.9687686966482124 0.9717050018254838 0.9637877964874163 1.0
Genderize 0.926775 0.9761303240374678 0.9655113956503119 1.0
Namsor 0.8672551055728626 0.9730097087378641 0.9236866359447006 1.0
Nameapi 0.8301886792452831 0.97420272191753 0.9054181612233341 1.0
Gender Guesser 0.7743554248139817 0.9848151408450704 0.8715900233826968 1.0

(Checked: 2019/10 until 2019/12)

These accuracies has been measured thinking in Lucía Santamaría and Helena Mihaljevic dataset as base of truth.

Accuracy (Damegender ML)

Name Accuracy Precision F1score Recall
SVC 0.879 0.972 0.972 1.0
Random Forest 0.862 0.902 0.902 1.0
NLTK (Bayes) 0.862 0.902 0.902 1.0
MultinomialNB 0.782 0.791 0.791 1.0
Tree 0.764 0.821 0.796 1.0
SGD 0.709 0.943 0.815 1.0
GaussianNB 0.709 0.968 0.887 1.0
BernoulliNB 0.699 0.965 0.816 1.0
MLP 0.677 0.819 0.755 1.0

In Damegender we are using the next datasets:

  • INE.es (Spain)
  • USA
  • United Kingdom
  • Uruguay

We hope better results with more languages.

Machine Learning Algorithms in DameGender These results are experimental, we are improving the choosing of features.

Confusion Matrix

  1. GenderApi

    male female undefined
    male 3589 155 67
    female 211 1734 23
  2. Genderguesser

    male female undefided
    male 3326 139 346
    female 78 1686 204
  3. Genderize

    male female undefined
    male 3157 242 412
    female 75 1742 151
  4. Namsor

    male female undefined
    male 3325 139 346
    female 78 1686 204
  5. Nameapi

    male female undefined
    male 2627 674 507
    female 667 1061 240
  6. Dame Gender

    male female undefined
    male 3033 778 0
    female 276 1692 0

    In this version of Dame Gender, we are not considering decide names as undefined.

Errors with files/names/all.csv has:

API error code error code without na na coded error gender bias
Genderize 0.0727 0.053 0.02 -0.008
Damegender 0.2547594323295258 0.2547594323295258 0.0 -0.04949809622706819
GenderApi 0.16666666666666666 0.16666666666666666 0.0 -0.16666666666666666
Gender Guesser 0.2255105572862582 0.026962383126766687 0.20404984423676012 0.0030441400304414
Namsor 0.16666666666666666 0.16666666666666666 0.0 0.16666666666666666
Nameapi 0.361 0.267 0.129 0.001

Performance

These performance metrics requires and csv json downloaded ################### Damegender!! Gender list: [1, 1, 1, 1, 1, 0] Guess list: [1, 1, 1, 1, 1, 0] Damegender accuracy: 1.0

real 0m1.270s user 0m0.876s sys 0m0.416s ################### Genderize!! Gender list: [1, 1, 1, 1, 1, 0] Guess list: [1, 1, 1, 1, 1, 0] Genderize accuracy: 1.0

real 0m0.811s user 0m0.776s sys 0m0.312s ################### Genderapi!! Gender list: [1, 1, 1, 1, 1, 0] Guess list: [1, 1, 1, 1, 1, 0] Genderapi accuracy: 1.0

real 0m0.763s user 0m0.744s sys 0m0.232s ################### Namsor!! Gender list: [1, 1, 1, 1, 1, 0] Guess list: [1, 1, 1, 1, 1, 0] Namsor accuracy: 1.0

real 0m0.811s user 0m0.776s sys 0m0.356s ################### Nameapi!! Gender list: [1, 1, 1, 1, 1, 0] Guess list: [1, 1, 1, 1, 1, 0] Nameapi accuracy: 1.0

real 0m0.832s user 0m0.816s sys 0m0.336s A confusion matrix C is such that Ci,j is equal to the number of observations known to be in group i but predicted to be in group j. If the classifier is nice, the diagonal is high because there are true positives Damegender confusion matrix:

[[ 5, 0, 0 ] [ 0, 1, 0 ]]

real 0m0.812s user 0m0.784s sys 0m0.300s Damegender with files/names/partial.csv has:

  • The error code: 0.10526315789473684
  • The error code without na: 0.10526315789473684
  • The na coded: 0.0
  • The error gender bias: 0.0

real 0m9.099s user 0m9.008s sys 0m0.412s

Statistics for damegender

Some theory could be useful to understand some commands

Measuring success and fails

To guess the sex, we have an true idea (example: female) and we obtain a result with a method (example: using an api, querying a dataset or with a machine learning model). The guessed result could be male, female or perhaps unknown. Remember some definitions about results about this matter:

True positive is find a value guessed as true if the value in the data source is positive.

True negative is find a value guessed as true if the the value in the data source is negative.

False positive is find a value guessed as false if the the value in the data source is positive.

False negative is find a value guessed as false if the the value in the data source is negative.

So, we can find a vocabulary for measure true, false, success and errors. We can make a summary in the gender name context about mathematical concepts:

Precision is about true positives divided by true positives plus false positives

(femalefemale + malemale ) /
(femalefemale + malemale + femalemale)

Recall is about true positives divided by true positives plus false negatives.

(femalefemale + malemale ) /
(femalefemale + malemale + malefemale + femaleundefined + maleundefined)

Accuray is about true positives divided by all.

(femalefemale + malemale ) /
(femalefemale + malemale + malefemale + femalemale + femaleundefined + maleundefined)

The F1 score is the harmonic mean of precision and recall taking both metrics into account in the following equation:

2 * (
(precision * recall) /
(precision + recall))

In Damengender, we are using accuracy.py to apply these concepts. Take a look to practice:

$ python3 accuracy.py --api="damegender" --measure="f1score" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="damegender" --measure="recall" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="damegender" --measure="precision" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="damegender" --measure="accuracy" --csv="files/names/partialnoundefined.csv"

$ python3 accuracy.py --api="genderguesser" --measure="f1score" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="genderguesser" --measure="recall" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="genderguesser" --measure="precision" --csv="files/names/partialnoundefined.csv"
$ python3 accuracy.py --api="genderguesser" --measure="accuracy" --csv="files/names/partialnoundefined.csv"

Error coded is about the true is different than the guessed:

(femalemale + malefemale + maleundefined + femaleundefined) /
(malemale + femalemale + malefemale +
femalefemale + maleundefined + femaleundefined)

Error coded without na is about the true is different than the guessed, but without undefined results.

(maleundefined + femaleundefined) /
(malemale + femalemale + malefemale +
femalefemale + maleundefined + femaleundefined)

Error gender bias is to understand if the error is bigger guessing males than females or viceversa.

(malefemale - femalemale) /
(malemale + femalemale + malefemale + femalefemale)

The weighted error is about the true is different than the guessed, but giving a weight to the guessed as undefined.

(femalemale + malefemale +
+ w * (maleundefined + femaleundefined)) /
(malemale + femalemale + malefemale + femalefemale +
+ w * (maleundefined + femaleundefined))

In Damengeder, we have coded errors.py to implement the different definitions in diffrent apis.

The confusion matrix creates a matrix about the true and the guess. If you have this confusion matrix:

[[ 2, 0, 0]
 [ 0, 5, 0]]

It means, I have 2 females true and I've guessed 2 females and I've 5 males true and I've guessed 5 males. I don't have errors in my classifier.

[[ 2  1  0]
[ 2 14  0]

It means, I have 2 females true and I've guessed 2 females and I've 14 males true and I've guessed 14 males. 1 female was considered male, 2 males was considered female.

In Damegender, we have coded confusion.py to implement this concept with the different apis.

PCA

Concepts

The dispersion measures between 1 variable, for instance, variance, standard deviation, …

img

If you have 2 variables, you can write a formula so similar to variance.

img

If you have 3 variables or more, you can write a covariance matrix.

img

In essence, an eigenvector v of a linear transformation T is a non-zero vector that, when T is applied to it, does not change direction. Applying T to the eigenvector only scales the eigenvector by the scalar value λ, called an eigenvalue.

img

A feature vector is constructed taking the eigenvectors that you want to keep from the list of eigenvectors.

The new dataset take the transpose of the vector and multiply it on the left of the original data set, transposed.

FinalData = RowFeatureVector x RowDataAdjust

We can choose PCA using the covariance method as opposed to the correlation method.

The covariance method has the next steps:

  1. Organize the data set
  2. Calculate the empirical mean
  3. Calculate the deviations from the mean
  4. Find the covariance matrix
  5. Find the eigenvectors and eigenvalues of the covariance matrix
  6. Rearrange the eigenvectors and eigenvalues
  7. Compute the cumulative energy content for each eigenvector
  8. Select a subset of the eigenvectors as basis vectors
  9. Project the z-scores of the data onto the new basis

The correlation method has the next steps:

  1. Compute the correlation matrix
  2. Solve for the correlation roots of R (product of eigenvalues)
  3. Compute the first column of the V matrix
  4. Compute the remaining columns of the V matrix
  5. Compute the L(1/2) matrix
  6. Compute the communality
  7. Diagonal elements report how much of the variability is explained
  8. Compute the coefficient matrix
  9. Compute the principal factors

Choosing components

We can choose components with:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--csv')
args = parser.parse_args()

#filepath = 'files/features_list.csv' #your path here
data = np.genfromtxt(args.csv, delimiter=',', dtype='float64')

scaler = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scaler.fit_transform(data[1:, 0:8])

#Fitting the PCA algorithm with our Data
pca = PCA().fit(data_rescaled)
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

img

Taking a look to the image. We can choose 6 components.

Load Dataset

We choose the file all.csv to generate features and a list to determine gender (male or female)

from pprint import pprint
import pandas as pd
import matplotlib.pyplot as plt
from app.dame_sexmachine import DameSexmachine
from app.dame_gender import Gender

## LOAD DATASET
g = Gender()
g.features_list2csv(categorical="both", path="files/names/all.csv")
features = "files/features_list.csv"

print("STEP1: N COMPONENTS + 1 TARGET")

x = pd.read_csv(features)
print(x.columns)

y = g.dataset2genderlist(dataset="files/names/all.csv")
print(y)

Standarize the data

print("STEP2: STANDARIZE THE DATA")
from sklearn.preprocessing import StandardScaler
# Standardizing the features
x = StandardScaler().fit_transform(x)

Pca Projection to N Dimensions

Finally, we create the pca transform with 6 dimensions and we add the target component.

from sklearn.decomposition import PCA
pca = PCA(n_components=6)
principalComponents = pca.fit_transform(x)
print("STEP3: PCA PROJECTION")
pprint(principalComponents)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2', 'principal component 3', 'principal component 4', 'principal component 5', 'principal component 6'])

target = pd.DataFrame(data = y, columns = ['target component'])

print(principalDf.join(target))

Analyze components to determine gender in names

first\\letter last\\letter last\\letter\\a first\\letter\\vocal last\\letter\\vocal last\\letter\\consonant target component
-0.2080025204 -0.3208958517 0.2352509625 0.2113242731 **0.6095269139** **-0.6095269139** -0.1035071139
**-0.6037951881** **0.5174873789** -0.4252467151 0.4278794455 0.0388287435 -0.0388287435 -0.0265942125
0.1049343046 0.1158117877 -0.2867605971 -0.3473950734 0.0901034539 -0.0901034539 -0.8697264971
0.2026467275 0.3142402839 **0.630802294** **0.5325769702** -0.1291229841 0.1291229841 -0.3811720011

In this analysis, we can observe 4 components.

The first component is about if the last letter is vocal or consonant. If the last letter is vocal we can find a male and if the last letter is a consonant we can find a male.

The second component is about the first letter. The last letter is determining females and the first letter is determining males.

The third component is not giving relevant information.

The fourth component is giving the lastlettera and the firstlettervocal is for females.

Speeches, Seminars, Expressions of Support

Beautiful Snakes

img

Dame Music

Listen music …

License

Copyright (C) 2019 David Arroyo Menendez Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in GNU Free Documentation License.

img

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

damegender-0.2.7rc3.tar.gz (50.8 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page