Skip to main content

A spaCy component for identifying grammatical gender in English texts.

Project description

gender spacy logo

About

Gender spaCy is a heuristic and machine learning pipeline that allows users to identify gender in an ethical way using gender-specific context. It is designed to sit alongside a standard spaCy pipeline (only English supported currently). The majority of the pipeline is rules-based, relying on titles and pronouns to identify gender as presented in the text. It is important to note that this pipeline does not seek to assign gender to an individual, rather contextually identify an entity's gender within the context of a text.

There are Python libraries, such as gender-resolver that assign gender based on the statistical usage of first names in a given region. This, however, gets into problematic territory and is not as reliable as gender-based context (such as titles and pronouns). As a result, this pipeline opts out of leveraging these libraries. Instead, entities identified as PERSON by the spaCy NER model are altered to the span label of PERSON_UNKNOWN. Next, this pipeline leverages the new experimental coreference resolution model from ExplosionAI. It looks at all clusters of linked tokens. If any of them align with PERSON_UNKNOWN tags and gender-specific pronouns are used, the entity's label is changed to a gender-specific label, e.g. PERSON_FEMALE, PERSON_MALE, PERSON_NEUTRAL. In addition, terms that are nouns that are linked to a specific person receive the tag "REL_MALE/FEMALE_COREF".

In addition to this, all gender-neutral pronouns are also identified and labeled as spans. This includes male, female, and gender neutral pronouns. Even transformer models have difficulty correctly parsing certain gender neutral pronouns due to their toponym nature, such as "per" which can function in English as an adverb (Per our discusion yesterday, I want to go to the store.) or as a gender neutral pronoun (Per went to the store yesterday). With a few extra rules, Gender spaCy corrects the POS tags for these toponyms in addition to placing all pronouns in the spans ruler.

Users can access all gender span data under doc.spans["ruler].

Installation

Because this pipeline leverages spaCy's new experimental coreference resolution model, it is best to install Gender spaCy in a fresh environment.

First, it is good to create a new environment.

conda create --name="gender-spacy" python=3.9

Now, activate the environment:

conda activate gender-spacy

Next, install GenderSpaCy

pip install gender-spacy

Finally, for the pipeline to perform coreference resolution, you should install the latest version of the spaCy experimental coreference resolution model.

pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.0/en_coreference_web_trf-3.4.0a0-py3-none-any.whl

Usage

# import the library
from gender_spacy import gender_spacy as gs

# create the GenderParser nlp class.
# This will take one argument: the spaCy model you wish to use
nlp = gs.GenderParser("en_core_web_sm")

# create a text and pass it to the the nlp via the process_doc() method.
text = """
Maya Angelou was an American memoirist, popular poet, and civil rights activist. She published seven autobiographies, three books of essays, several books of poetry, and is credited with a list of plays, movies, and television shows spanning over 50 years.

Jerome Allen Seinfeld is an American stand-up comedian, actor, writer, and producer. He is best known for playing a semi-fictionalized version of himself in the sitcom Seinfeld (1989–1998), which he created and wrote with Larry David.
"""
doc = nlp.process_doc(text)

# perform coreference resolution on the doc container
# This part of the library comes from spacy-experimental
doc = nlp.coref_resolution()

# Visualize the result:
nlp.visualize()

Expected Result

result demo

CITATIONS

Source for gender pronouns: https://uwm.edu/lgbtrc/support/gender-pronouns/

Source for Coreference Resolution: https://explosion.ai/blog/coref

Discussion for Coref Code: https://github.com/explosion/spaCy/discussions/11585

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gender-spacy-0.0.5.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

gender_spacy-0.0.5-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file gender-spacy-0.0.5.tar.gz.

File metadata

  • Download URL: gender-spacy-0.0.5.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for gender-spacy-0.0.5.tar.gz
Algorithm Hash digest
SHA256 2168c7788ca62013b9f966a3ca8e6d774c069da5fe15f391dc083390c7597ff8
MD5 8657fefb0225f98e93a53a2361cc9219
BLAKE2b-256 8347521eb35dbd3cd2e4ba8b9fc7d2bea6a40d19b1928aa983cb7f5f97b2cc38

See more details on using hashes here.

File details

Details for the file gender_spacy-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for gender_spacy-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f5f060bdf4d5f4b0701090120b44044bccf03789c367627eeb1b5659be26c4b8
MD5 44c6c96c69ffc49cb2e467ac9da86058
BLAKE2b-256 e6ecea90caa968c22f1fd5d74aa6caa15008e73fe5a7a3ee78161bec067dd3e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page