Skip to main content

Clean, functional data processing for human-centric applications. Normalize and standardize names, emails, phones, departments, and job titles with a single unified API.

Project description

HumanMint v2

HumanMint cleans and normalizes messy contact data with one line of code. It standardizes names, emails, phones, addresses, departments, titles, and organizations. It is a general-purpose cleaner for B2B and public-sector data, and ships with curated public-sector mappings you won’t find anywhere else.

from humanmint import mint

result = mint(
    name="Dr. John Q. Smith, PhD",
    email="JOHN.SMITH@CITY.GOV",
    phone="(202) 555-0173 ext 456",
    department="001 - Public Works Dept",
    title="Chief of Police",
    address="123 N. Main St Apt 4B, Madison, WI 53703",
    organization="City of Madison Police Department",
)

result.name_standardized          # "John Q Smith"
result.email_standardized         # "john.smith@city.gov"
result.phone_pretty               # "+1 202-555-0173"
result.department_canonical       # "Public Works"
result.title_canonical            # "police chief"
result.address_canonical          # "123 N. Main St Apt 4B Madison WI 53703 US"

# Split multi-person names when needed
results = mint(name="John and Jane Smith", split_multi=True)
# returns [MintResult(John Smith), MintResult(Jane Smith)]

Why HumanMint

  • General-purpose: works for government data and B2B (execs, VPs, directors, managers) without switching libraries.
  • Real-world chaos: titles inside names, departments with numbers/phone extensions, strange-casing emails, smashed-together addresses.
  • Unique data: 23K+ department variants → 64 categories; 73K+ titles with curated canonicals + BLS; context-aware (dept-informed) title mapping not available off-the-shelf.
  • Safe defaults: length guards, optional aggressive cleaning, semantic conflict checks, bulk dedupe, and optional multi-person name splitting.

Department & Title mapping you can’t get elsewhere

Curated public-sector mappings that solve the “impossible to Google” parts of contact normalization. Works for governments and B2B roles (CEOs, VPs, Directors, Managers) alike.

"City Administration"    -> "Administration"       [administration]
"Finance Department"     -> "Finance"              [finance]
"Public Works"           -> "Public Works"         [infrastructure]
"Police Department"      -> "Police"               [public safety]

Titles get similar treatment across 73K standardized forms with optional department context to boost accuracy.

All fields in one library

Names, emails, phones, addresses, departments, titles, organizations—one pipeline. Most libraries clean only one field (just names or just phones); HumanMint normalizes the entire record with canonicalization, categorization, and confidence.

Fast

Typical workloads run sub-millisecond per record with multithreading and built-in dedupe.

AI extraction (optional)

Install the ML extra (pip install humanmint[ml]) and pass text= with use_gliner=True to extract from unstructured text, then normalize. Structured fields you pass always win. You can also pass a GlinerConfig (gliner_cfg) to control schema, threshold, and GPU usage. GLiNER extraction is experimental and may be inaccurate; prefer structured inputs when available.

Example (signature block → canonicalized):

text = """
John A. Miller
Deputy Director of Public Works
City of Springfield, Missouri
305 E McDaniel St, Springfield, MO 65806
Phone: (417) 864-1234
Email: jmiller@springfieldmo.gov
"""

result = mint(text=text, use_gliner=True)

# Result:
# MintResult(
#   name: John A Miller
#   email: jmiller@springfieldmo.gov
#   phone: +1 417-864-1234
#   department: Public Works
#   title:
#     raw: Deputy Director
#     normalized: Deputy Director
#     canonical: deputy director
#   address: None
#   organization: Springfield Missouri
# )

You can also batch texts: mint(texts=[...], use_gliner=True) returns a list of MintResult objects.

Advanced GLiNER configuration:

from humanmint.gliner import GlinerConfig

cfg = GlinerConfig(
    threshold=0.85,    # optional confidence threshold
    use_gpu=True,      # move model to GPU if available
    schema=None,       # custom schema dict if desired
    extractor=None,    # reuse a preloaded GLiNER2 instance
)

result = mint(text=text, use_gliner=True, gliner_cfg=cfg)

What’s new in v2 (vs v1)

  • Clear, canonical property names: name_standardized, email_standardized, phone_standardized, title_canonical, department_canonical (legacy aliases removed).
  • Explainable comparisons: compare(..., explain=True) shows component scores/penalties.
  • Multi-person name splitting: split_multi=True handles “John and Jane Smith”.
  • Name enrichment: detects nicknames and generational suffixes without polluting the main name fields.
  • Optional GLiNER extraction for unstructured text via use_gliner=True and GlinerConfig; multi-person GLiNER input raises a clear error.
  • Structured-field pipeline remains deterministic and fast; GLiNER is opt-in and experimental.

Installation

pip install humanmint
# Optional extras:
#   pip install humanmint[address]  # usaddress parsing
#   pip install humanmint[pandas]   # DataFrame helpers
#   pip install humanmint[ml]       # GLiNER2 extraction

Quickstart

from humanmint import mint, compare, bulk

r1 = mint(name="Jane Doe", email="jane.doe@city.gov", department="Public Works", title="Engineer")
r2 = mint(name="J. Doe",  email="JANE.DOE@CITY.GOV", department="PW Dept",       title="Public Works Engineer")

score = compare(r1, r2)  # similarity 0–100
# Or with explanation:
score, why = compare(r1, r2, explain=True)
print("\n".join(why))

records = [
    {"name": "Alice", "email": "alice@example.com"},
    {"name": "Bob",   "email": "bob@example.com"},
]
results = bulk(records, workers=4)

Access Patterns

Quick reference (full field guide in docs/FIELDS.md):

  • Dict access: result.title["canonical"], result.department["canonical"], result.department["category"]
  • Properties (preferred): name_standardized, title_canonical, department_canonical, email_standardized, phone_standardized, address_canonical, organization_canonical
  • Full dicts: result.title, result.department, result.email, etc.

Recommended Properties (quick reference)

Namesname_standardized, name_first, name_last, name_middle, name_suffix, name_suffix_type, name_gender, name_nickname

Emailsemail_standardized, email_domain, email_is_valid, email_is_generic_inbox, email_is_free_provider

Phonesphone_standardized, phone_e164, phone_pretty, phone_extension, phone_is_valid, phone_type

Departmentsdepartment_canonical, department_category, department_normalized, department_override

Titlestitle_canonical, title_raw, title_normalized, title_is_valid, title_confidence, title_seniority

Addressesaddress_canonical, address_raw, address_street, address_unit, address_city, address_state, address_zip, address_country

Organizationsorganization_raw, organization_normalized, organization_canonical, organization_confidence

Use result.get("email.is_valid") or other dot paths to fetch nested dict values.

Comparing Records

from humanmint import compare
score = compare(r1, r2)  # 0–100
# >85 likely duplicate, >70 similar, <50 different

Batch & Export

from humanmint import bulk, export_json, export_csv, export_parquet, export_sql

results = bulk(records, workers=4, progress=True)
export_json(results, "out.json")
export_csv(results, "out.csv", flatten=True)

CLI

humanmint clean input.csv output.csv --name-col name --email-col email --phone-col phone --dept-col department --title-col title

Performance (benchmark)

Dataset Time Per Record Throughput
1,000 561 ms 0.56 ms 1,783 rec/sec
10,000 3.1 s 0.31 ms 3,178 rec/sec
50,000 14.0 s 0.28 ms 3,576 rec/sec

Notes

  • US-focused address parsing; usaddress is used when available, otherwise heuristics.
  • Optional deps (pandas, pyarrow, sqlalchemy, rich, tqdm) enhance exports and progress bars.
  • Department and title datasets are curated and updated regularly for best accuracy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humanmint-2.0.1b0.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

humanmint-2.0.1b0-py3-none-any.whl (2.0 MB view details)

Uploaded Python 3

File details

Details for the file humanmint-2.0.1b0.tar.gz.

File metadata

  • Download URL: humanmint-2.0.1b0.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for humanmint-2.0.1b0.tar.gz
Algorithm Hash digest
SHA256 3da7fae2994e0159d297bac44f6abaa7d16123147f95d9cf4c27c8b627764326
MD5 9c83d6a6666a102f140189e77a3027aa
BLAKE2b-256 06716dbb44e97062f82daa375d14cf171efd8fd6cab44001abb607b8d51c1e9f

See more details on using hashes here.

File details

Details for the file humanmint-2.0.1b0-py3-none-any.whl.

File metadata

  • Download URL: humanmint-2.0.1b0-py3-none-any.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for humanmint-2.0.1b0-py3-none-any.whl
Algorithm Hash digest
SHA256 4a807944e0b8367dce98d355ede161e173d81936cb562714d6163236ee1d9e09
MD5 d7f4dc61258b3bfd56195d509aa661e6
BLAKE2b-256 c445044dd568b24cec421f474e7138fbb35c4c7399532f8c937d674fdf3d39ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page