Skip to main content

​⚡ Ultra-fast name-to-gender prediction engine. Uses mmap and binary search for <1ms lookups. Features a pre-compiled 4.6MB binary database covering 700k+ global entries. Built for scale.

Project description

gender-detect

A high-performance, binary-search based gender detection library and CLI tool. This project uses a pre-compiled binary database to predict gender and country of origin based on first names with sub-millisecond latency using memory mapping (mmap).

Features

  • Extreme Speed: Uses binary search (O(log n)) on a packed binary database with mmap for zero-copy lookups.
  • Zero-Dependency: Built entirely using Python standard libraries.
  • CLI Ready: Includes a built-in table-formatted command line interface.
  • Privacy Focused: 100% local; no external API calls or data tracking.

Installation

pip install gender-detect

CLI Usage

After installation, you can use the gender-detect command directly from your terminal:

gender-detect John

For automation, you can output the result in raw JSON:

gender-detect John --json

Library Usage

Simple Prediction

Input a name to get a statistical analysis of the likely gender and primary origin.

from gender_detect import GenderDetector

gd = GenderDetector()
result = gd.predict("John")

print(result)

Response Format

The gender_probability represents the likelihood of the gender being correct based on total global samples.

{
  "name": "john",
  "likely_gender": "male",
  "gender_probability": 0.83,
  "top_reported_country": "US",
  "data_breakdown": [
    {
      "country": "US",
      "male_samples": 4,
      "female_samples": 1
    },
    {
      "country": "GB",
      "male_samples": 1,
      "female_samples": 0
    }
  ]
}

How it Works

The library utilizes a custom packed binary format (4sHBB):

  • 4 bytes: BLAKE2b hash prefix of the name.
  • 2 bytes: ISO-3166-1 numeric country code.
  • 1 byte: Male sample count.
  • 1 byte: Female sample count.

By sorting these 8-byte entries by their hash, the library performs a binary search directly on the file disk/memory, ensuring a tiny memory footprint regardless of database size.

Contribution

Data contributions are managed through contribute.json in the main repository.

  1. Add your name data to the JSON list.
  2. Ensure country_code is the numeric ISO-3166-1 value.
  3. Submit a Pull Request.

The CI/CD pipeline automatically validates the JSON and recompiles the names.bin database upon merging.

License

MIT - See LICENSE file for details.

Project details


Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page