Skip to main content

Clean, functional data processing for human-centric applications. Normalize and standardize names, emails, phones, departments, and job titles with a single unified API.

Project description

HumanMint

Clean, functional data processing for human-centric applications. Normalize and standardize names, emails, phones, addresses, departments, job titles, and organizations with a single unified API.

Installation

pip install humanmint

Optional Dependencies

Install with optional features for advanced use cases:

# Development tools (testing, linting, type checking)
pip install humanmint[dev]

# US address parsing support
pip install humanmint[address]

# Pandas integration for DataFrames
pip install humanmint[pandas]

# All optional features
pip install humanmint[dev,address,pandas]

Feature Details:

  • dev: Includes pytest, black, flake8, mypy, and faker for development and testing
  • address: Enables advanced US postal address parsing with usaddress (optional—falls back to basic regex parsing)
  • pandas: Adds DataFrame integration via df.humanmint.clean(...) accessor (optional—gracefully skipped if not installed)

Note on Optional Dependencies:

  • Address without usaddress: The normalize_address() function still works! It falls back to basic regex-based parsing (street, city, state, ZIP extraction). For advanced address parsing, install the address extra.
  • Pandas not installed: The base library works fine without pandas. The DataFrame accessor (df.humanmint.clean()) is silently skipped if pandas isn't available, so your code won't break.

Quick Start

from humanmint import mint

result = mint(
    name="Dr. John Smith",
    email="JOHN.SMITH@CITY.GOV",
    phone="(202) 555-0173",  # Use real area codes; 555 is reserved for fictional use
    address="123 Main St, Springfield, IL 62701",
    department="Public Works 850-123-1234 ext 200",
    title="Chief of Police"
)

print(result.name)              # {'full': 'John Smith', 'first': 'John', 'last': 'Smith', 'gender': 'Male'}
print(result.email)             # 'john.smith@city.gov'
print(result.phone)             # '+1 202-555-0173'
print(result.address)           # {'street': '123 Main St', 'city': 'Springfield', 'state': 'IL', 'zip': '62701'}
print(result.department)        # 'Public Works'
print(result.department_category) # 'Infrastructure'
print(result.title)             # {'canonical': 'police chief', 'is_valid': True, ...}

Why HumanMint?

Problem

Real-world contact data is messy:

  • Names with titles: "Dr. Jane Smith, PhD"
  • Phone numbers in various formats: "(202) 555-0123" vs "202.555.0123" (use real area codes; 555 is reserved for fiction)
  • Departments with codes and noise: "000171 - Public Works 202-555-0150 ext 200"
  • Job titles that need standardization: "Chief of Police" -> canonical form
  • Emails with inconsistent casing: "JOHN.SMITH@EXAMPLE.COM"
  • Single names that need gender inference: "Jo", "Al", "Ty"

Solution

HumanMint cleans and standardizes everything in one call.

Features

  • Unified API: One mint() call to normalize names, emails, phones, addresses, departments, titles, and organizations.
  • Rich names: Normalize, enrich, infer gender, detect nicknames; handles single, hyphenated, and titled names.
  • Emails: Lowercasing, validation, generic inbox detection, domain extraction, free provider detection.
  • Phones: E164/pretty formats, extension extraction, validation, type detection (mobile/landline/fax), VoIP and impossible number detection.
  • Addresses: US postal address parsing (street, city, state, ZIP).
  • Departments: 23,452 mappings to 64 canonicals with fuzzy matching, categorization, and custom overrides.
  • Titles: Canonicalization and fuzzy matching against curated heuristics with confidence scores.
  • Organizations: Normalize agency/organization names by removing civic prefixes and suffixes.
  • Pandas: df.humanmint.clean(...) accessor with heuristic column guessing or explicit name_col/email_col/... mapping.
  • CLI: humanmint clean input.csv output.csv with auto-guessing or explicit column flags.
  • Batch processing: Parallel processing with bulk() for handling large datasets.
  • Gzip-backed data: Reference data ships as .json.gz caches for fast loads; raw sources live under src/humanmint/data/original/.
  • Ethics: Gender inference is probabilistic from historical name data and not a determination of identity; downstream use should respect that.

API Reference

Core Functions

mint(name, email, phone, address, department, title, organization, ...)

One function. All your data cleaned.

from humanmint import mint

result = mint(
    name="Jane Doe",                    # optional
    email="jane@example.com",           # optional
    phone="(202) 555-0172",             # optional (use real area codes)
    address="123 Main St, City, ST ZIP", # optional
    department="Water Utilities",       # optional
    title="Chief of Water",             # optional
    organization="City of Springfield"  # optional
)

# Access cleaned data
result.name                 # dict: {full, first, last, middle, suffix, gender}
result.email                # str
result.phone                # str (E.164 format)
result.address              # dict: {street, city, state, zip}
result.department           # str (canonical)
result.department_category  # str
result.title                # dict: {raw, cleaned, canonical, is_valid, confidence}
result.organization         # dict: {raw, normalized, canonical, confidence}

# Convert to dict for JSON
result.model_dump()

bulk(records, workers=4, progress=False)

Process multiple records in parallel.

from humanmint import bulk

records = [
    {"name": "Alice", "email": "alice@example.com"},
    {"name": "Bob", "email": "bob@example.com"},
]

results = bulk(records, workers=4, progress=True)
# Returns: list[MintResult]

compare(result_a, result_b)

Compare two normalized records and return similarity score (0-100).

from humanmint import mint, compare

r1 = mint(name="John Smith", email="john@example.com")
r2 = mint(name="Jon Smith", email="john.smith@example.com")

similarity = compare(r1, r2)  # Returns float 0-100

Names Module

Import: from humanmint.names import ...

Function Purpose Returns
normalize_name(raw) Parse and normalize a full name Dict: {first, last, middle, suffix, full, canonical, is_valid}
infer_gender(first_name, confidence=False) Infer gender from first name Dict: {gender, confidence} or str
enrich_name(normalized_dict, include_gender=True) Add gender/enrichment data to normalized name Dict with enriched fields
detect_nickname(first_name) Detect if a name is a nickname, return canonical form Optional[str]
get_nickname_variants(canonical_name) Get all known nicknames for a name set[str]
get_name_equivalents(name) Get all equivalent forms (nicknames + canonicals) set[str]
compare_first_names(name1, name2, use_nicknames=True) Compare two first names with fuzzy matching float (0-1)
compare_last_names(last1, last2) Compare two last names float (0-1)
match_names(raw1, raw2, strict=False) Full name matching with detailed scoring Dict: {score, is_match, reasons, ...}

Examples:

from humanmint.names import normalize_name, infer_gender, detect_nickname

# Normalize
result = normalize_name("Dr. Jane Smith, PhD")
# {'first': 'Jane', 'last': 'Smith', 'full': 'Jane Smith', 'gender': None, ...}

# Infer gender
gender = infer_gender("Jane")  # {'gender': 'Female', 'confidence': 0.98}

# Detect nickname
canonical = detect_nickname("Bobby")  # 'Robert'
variants = get_name_equivalents("Robert")  # {'Robert', 'Bob', 'Bobby', 'Rob', ...}

Emails Module

Import: from humanmint.emails import ...

Function Purpose Returns
normalize_email(raw, generic_inboxes=None) Normalize email and extract metadata Dict: {email, local, domain, is_generic, is_free_provider, is_valid}
is_free_provider(domain) Check if domain is a free email provider bool
guess_email(name, domain, known=[]) Guess likely email pattern from known examples str (email or empty)
get_pattern_scores(known) Analyze known emails and return detected patterns list[(pattern_id, confidence)]
describe_pattern(pattern_id) Get documentation for an email pattern Optional[Dict]

Examples:

from humanmint.emails import normalize_email, guess_email, is_free_provider

# Normalize
result = normalize_email("JOHN.SMITH@GMAIL.COM")
# {'email': 'john.smith@gmail.com', 'domain': 'gmail.com', 'is_free_provider': True, ...}

# Check if free
is_free = is_free_provider("gmail.com")  # True

# Guess email pattern
known = [("John Smith", "jsmith@company.com"), ("Jane Doe", "jdoe@company.com")]
guess = guess_email("Bob Jones", "company.com", known)  # 'bjones@company.com'

Phones Module

Import: from humanmint.phones import ...

Function Purpose Returns
normalize_phone(raw, country="US") Normalize phone to E.164 format Dict: {e164, pretty, extension, country, type, is_valid}
detect_impossible(phone_dict) Detect if phone appears impossible/test/fake bool
detect_fax_pattern(phone_dict) Detect if phone matches known fax patterns bool
detect_voip_pattern(phone_dict) Detect if phone matches VoIP provider patterns bool

Examples:

from humanmint.phones import normalize_phone, detect_fax_pattern

# Normalize
result = normalize_phone("(202) 555-0123 ext 201")
# {'e164': '+12025550123', 'pretty': '+1 202-555-0123', 'extension': '201', 'type': 'mobile', ...}

# Detect fax
fax = detect_fax_pattern(result)  # False

Departments Module

Import: from humanmint.departments import ...

Function Purpose Returns
normalize_department(raw_dept) Normalize dept name (remove noise, standardize format) str
find_best_match(dept_name, threshold=0.6, normalize=True) Find best canonical match Optional[str]
find_all_matches(dept_name, threshold=0.6, top_n=3, normalize=True) Find all matches ranked by similarity list[str]
match_departments(dept_names, threshold=0.6, normalize=True) Match multiple depts at once dict[str, Optional[str]]
get_similarity_score(dept1, dept2) Calculate similarity between two depts float (0-1)
get_department_category(dept) Get category for canonical dept Optional[str]
get_all_categories() Get all available categories set[str]
get_departments_by_category(category) Get depts in a specific category list[str]
categorize_departments(depts) Categorize multiple depts dict[str, Optional[str]]
get_canonical_departments() Get all canonical dept names list[str]
is_canonical(dept) Check if dept is canonical bool
load_mappings() Load all dept mappings dict[str, list[str]]
get_mapping_for_original(original) Get canonical for original name Optional[str]
get_originals_for_canonical(canonical) Get all original names for canonical list[str]

Examples:

from humanmint.departments import normalize_department, find_best_match, get_department_category

# Normalize
clean = normalize_department("000171 - Police Department")  # 'Police'

# Find match
match = find_best_match("Police Dept")  # 'Police'
category = get_department_category("Police")  # 'Public Safety'

# Find all matches
all_matches = find_all_matches("Finance", threshold=0.7)  # ['Finance', 'Budget']

Titles Module

Import: from humanmint.titles import ...

Function Purpose Returns
normalize_title_full(raw_title, threshold=0.6, dept_canonical=None, overrides=None) Full normalization with confidence TitleResult Dict: {raw, cleaned, canonical, is_valid, confidence}
normalize_title(raw_title) Core title cleaning str
find_best_match(title, threshold=0.6, normalize=True) Find best canonical match tuple[Optional[str], float]
find_all_matches(title, threshold=0.6, top_n=3) Find all matches list[str]
get_similarity_score(title1, title2) Calculate similarity float (0-1)
get_canonical_titles() Get all canonical titles list[str]
is_canonical(title) Check if title is canonical bool
get_mapping_for_variant(variant) Get canonical for variant Optional[str]
get_all_mappings() Get all title mappings dict[str, str]

Examples:

from humanmint.titles import normalize_title_full, find_best_match, get_canonical_titles

# Full normalization
result = normalize_title_full("0001 - Chief of Police (Downtown)")
# {'raw': '0001 - Chief of Police (Downtown)', 'cleaned': 'Chief of Police', 'canonical': 'police chief', ...}

# Find match with confidence
title, confidence = find_best_match("Police Chief", threshold=0.7)
# ('police chief', 0.95)

Addresses Module

Import: from humanmint.addresses import ...

Function Purpose Returns
normalize_address(raw) Parse US postal address Optional[Dict]: {street, city, state, zip, canonical}

Examples:

from humanmint.addresses import normalize_address

result = normalize_address("123 Main St, Springfield, IL 62701")
# {'street': '123 Main St', 'city': 'Springfield', 'state': 'IL', 'zip': '62701', ...}

Organizations Module

Import: from humanmint.organizations import ...

Function Purpose Returns
normalize_organization(raw) Normalize agency/org name (remove civic suffixes) Optional[Dict]: {raw, normalized, canonical, confidence}

Examples:

from humanmint.organizations import normalize_organization

result = normalize_organization("City of Springfield")
# {'raw': 'City of Springfield', 'normalized': 'Springfield', 'canonical': 'Springfield', ...}

Customization

You can steer canonicals without forking data files:

  • Department overrides: Map normalized departments to your preferred canonical. Example: mint(department="RevOps", dept_overrides={"revenue operations": "Sales"}) or pass the same dict into bulk()/pandas/CLI.
  • Title overrides / ignores: Map cleaned titles to a canonical string with title_overrides. To ignore a title, set it to None and drop records where result.title is None or result.title["is_valid"] is False.

Examples

Government Contact

result = mint(
    name="Chief Robert Patterson",
    email="robert.patterson@police.gov",
    phone="(202) 555-0178",  # Real area code; 555 is reserved
    department="000171 - Police",
    title="Chief of Police"
)

# Cleaned:
# name: {'full': 'Robert Patterson', 'first': 'Robert', 'last': 'Patterson', 'gender': 'Male'}
# email: 'robert.patterson@police.gov'
# phone: '+1 202-555-0178'
# department: 'Police' (category: 'Public Safety')
# title: {'canonical': 'police chief', 'is_valid': True}

Messy Data

result = mint(
    name="Dr. Jane Smith, PhD",
    email="JANE@EXAMPLE.COM",
    phone="415 555 0123",  # Real area code; 555 is reserved
    department="Planning and Development 415-555-0123 ext 200"
)

# Cleaned:
# name: {'full': 'Jane Smith', 'first': 'Jane', 'last': 'Smith', 'gender': 'Female'}
# email: 'jane@example.com'
# phone: '+1 415-555-0123'
# department: 'Planning' (category: 'Planning & Development')

Single Names

result = mint(name="Madonna")
# name: {'full': 'Madonna', 'first': 'Madonna', 'last': '', 'gender': 'Female'}

result = mint(name="Jo")
# name: {'full': 'Jo', 'first': 'Jo', 'last': '', 'gender': 'Female'}

Advanced Usage

Use individual modules for specific needs. See API Reference above for the complete list of functions across all modules:

# Names: normalize, enrich, infer gender, detect nicknames
from humanmint.names import normalize_name, infer_gender, detect_nickname

# Emails: validate, detect free providers, guess patterns
from humanmint.emails import normalize_email, is_free_provider, guess_email

# Phones: normalize, detect types, identify fax/VoIP/test numbers
from humanmint.phones import normalize_phone, detect_fax_pattern, detect_voip_pattern

# Departments: normalize, match, categorize
from humanmint.departments import normalize_department, find_best_match, get_department_category

# Titles: normalize, match, get mappings
from humanmint.titles import normalize_title_full, find_best_match as match_title

# Addresses: parse and normalize
from humanmint.addresses import normalize_address

# Organizations: normalize and extract canonical names
from humanmint.organizations import normalize_organization

# Batch processing and comparison
from humanmint import bulk, compare

Testing

pytest -q unittests

CLI

Clean a CSV (auto-detecting columns, or override with flags):

humanmint clean input.csv output.csv --name-col name --email-col email

Regenerating data caches

If you edit the raw datasets in src/humanmint/data/original/, rebuild the packaged .json.gz caches with:

python scripts/build_caches.py

Performance

  • Batch processing: ~0.1-0.2ms per contact
  • 10 contacts: ~1-2ms
  • 100 contacts: ~10-20ms

What Gets Cleaned

Input Output
name="Dr. Jane Smith, PhD" first='Jane', last='Smith'
email="JOHN@EXAMPLE.COM" john@example.com
phone="(202) 555-0101 x101" +1 202-555-0101, extension: 101
address="123 Main St, Springfield, IL 62701" street='123 Main St', city='Springfield', state='IL', zip='62701'
department="000171 - Public Works 202-555-0150 ext 200" Public Works (category: Infrastructure)
title="0001 - Chief of Police (Downtown)" police chief
organization="City of Springfield Police" normalized='Springfield', canonical='Springfield'
name="Jo" first='Jo', gender: Female

Version

0.1.0

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humanmint-0.1.8.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

humanmint-0.1.8-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file humanmint-0.1.8.tar.gz.

File metadata

  • Download URL: humanmint-0.1.8.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for humanmint-0.1.8.tar.gz
Algorithm Hash digest
SHA256 9f115087d7e4e7f6c11f921e326930d3684b2aa38ae09bb14c3378c4b15cd737
MD5 e901e0bdba29d7b8ea0942d16e1f9ed9
BLAKE2b-256 3b4a001b6001ccdb79446a16a1352f99eb27872cfc79d0918e98dbad6eec0b0a

See more details on using hashes here.

File details

Details for the file humanmint-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: humanmint-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for humanmint-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 37f6bc224295330867071b83c23f0d5c2a231de3cabb9d5a0adbe5769afeb030
MD5 d0198579751d121b2d6d6d421daac9f7
BLAKE2b-256 0a8fae907bcb055f3db90f6514c571672d51aecf1e3a9153aceedf3581c504ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page