Skip to main content

Genderizer tries to infer gender information looking at first name and/or making text analysis

Project description

Genderizer

Genderizer is a language independent module which tries to detect gender by looking at given first names and/or analyzing sample texts.

Always remember:

Data-driven predictions can succeed --and they can fail. It is when we deny our role in the process that the odds of failure rise. Before we demand more of our data, we need to demand more of ourselves. -- The Signal and The Noise by Nate Silver

##Installation You can install this package using the following pip command:

$ sudo pip install genderizer3

##Examples

from genderizer3.genderizer3 import Genderizer

print(Genderizer.detect(firstName='John'))
# >>> male

print(Genderizer.detect(firstName='Marry'))
# >>> female

print(Genderizer.detect(firstName='mustafa'))
# >>> male

print(Genderizer.detect(firstName='fatma'))
# >>> female

print(Genderizer.detect(firstName='fikret'))
# >>> None

print(Genderizer.detect(firstName='fikret', text='galatasary maçı kaçmaz'))
# >>> male

print(Genderizer.detect(firstName='fikret', text='annemlerle yemek yedik'))
# >>> female

print(Genderizer.detect(text='askerlik yoklamasını kaçırdım mk'))
# >>> male

print(Genderizer.detect(text='bana çiçek alan erkek için canım feda'))
# >>> female

print(Genderizer.detect(firstName='fatma', text='askerlik yoklamasını kaçırdım mk'))
# >>> female

print(Genderizer.detect(text='futbol sevgi'))
# >>> None

print(Genderizer.detect(text='lan bi siktir git'))
# >>> male

Note: You may claim that women can say lan bi siktir git, of course. But the probability of being female is less than the probability of being male according to the trained data of the classifier.

So it is obvious that the success of detection depends on the trained data.

By the way, in Turkish saying 'lan bi siktir git' makes you quite rude.

How It Works

Genderizer is a module which tries to detect gender by looking given first names and/or analyzing sample text of a person.

If a first name is definitely used for only one gender, the system will accept this gender and will not make any further analysis. For example, while 'Mustafa', 'Osman', 'Hasan' is used in Turkish only for male; 'Fatma', 'Ceyda', 'Elif' only for female.

When looking at first names does not infer any gender for sure, the system will make text analysis if it is given. For example; 'Ekim', 'Meric', or 'Tümay' is used for both male and female.

The text analysis is the classification of sample texts. It simply try to compute the probability of being male or female mining the sample text. In this system Naive Bayesian Classification is adopted and naiveBayesClassifier is used.

How To Improve It

TODO: write a step by step guide

##Preparing Language Dependent Training Sets TODO: give a few examples

Customization and Optimization

Using Memcached For Speed

"""
Under heavy usage, for example tens of thousands detection request
in a few seconds the default configuration could not meet the
demand. By the default configurations, genderize will load necessary
data from files and this is well known to be slow. Instead of each
time loading data into memory, doing this one time will be clever
approach. One of the best way of this approach is to use memcached.
For more information have a look at the documentation of memcached.

Genderizer provides a memcached interface to store first names in 
memory. To active this interface, you need to instantiate 
memcachedNamesCollection interface and pass it to genderizer3 while 
initializing it.
"""

from genderizer3.memcachedNamesCollection import MemcachedNamesCollection

# For memcached, do not forget to setup the memcached server.
MemcachedNamesCollection.memcacheHost = '127.0.0.1:11211'
Genderizer.init(
    namesCollection=MemcachedNamesCollection
)
print
Genderizer.detect(firstName='John')

Using Mongodb

"""
If you want to use Genderize on Mongodb for arbitrary reasons, the
MongoNamesCollection first names collection interface will do much
of the necessary works for you.
"""
from genderizer3.mongoNamesCollection import MongoNamesCollection

MongoNamesCollection.mongodbURL = 'mongodb://192.168.1.170'
Genderizer.init(
    namesCollection=MongoNamesCollection
)
print
Genderizer.detect(firstName='Marry')

Custom Text Classifier

"""
NaiveBayesClassifier is adopted as the default classifier. But you
can use another, entirely different classifier; as long as the
classifier has a 'classify' method taking text as a parameter.

For more information please have a look at the naiveBayesClassifier
project's documentation.
https://github.com/muatik/naive-bayes-classifier
"""

from naiveBayesClassifier import tokenizer
from naiveBayesClassifier.classifier import Classifier
from cachedModel import CachedModel

Genderizer.init(
    lang='en',
    classifier=Classifier(CachedModel.get('en'), tokenizer)
)

print
Genderizer.detect(firstName='fikret', text='annemle kahve keyfi')

TODO

  • inline docs
  • unit-tests

Original AUTHORS

  • Mustafa Atik @muatik
  • Nejdet Yucesoy @nejdetckenobi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genderizer3-0.1.2.6.tar.gz (9.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genderizer3-0.1.2.6-py3-none-any.whl (18.0 MB view details)

Uploaded Python 3

File details

Details for the file genderizer3-0.1.2.6.tar.gz.

File metadata

  • Download URL: genderizer3-0.1.2.6.tar.gz
  • Upload date:
  • Size: 9.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.0

File hashes

Hashes for genderizer3-0.1.2.6.tar.gz
Algorithm Hash digest
SHA256 1a39aa3bc202de0cd9576c742b54575f0a8d3bd5e2e78793e0c43acb75703211
MD5 716d5f587f9eb36afcd99b2b3f22dbd3
BLAKE2b-256 4fb20c355e51af30c440d4e643cbb4e8f9fe1aea0dcd6b1f001eb35b0c3646b8

See more details on using hashes here.

File details

Details for the file genderizer3-0.1.2.6-py3-none-any.whl.

File metadata

  • Download URL: genderizer3-0.1.2.6-py3-none-any.whl
  • Upload date:
  • Size: 18.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.0

File hashes

Hashes for genderizer3-0.1.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 8debd4745f02a95a96e8d205c39b3a928883a55738662e87a92d393631904823
MD5 0e9e1eb07e06e154c0982c3f8eef9169
BLAKE2b-256 ec035cc71e5a55d5a13b12cca40a8baa748b95f0b5516651c7d04f65b78aa1fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page