Skip to main content

A package for Fuzzy Lookup

Project description

FuzzyLookup Documentation

Overview

fuzzylookup (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages pandas for data handling and rapidfuzz for high-performance string matching [cite: 2, 3].

Installation

The package requires Python 3.8 or higher [cite: 3]. Dependencies include:

  • pandas>=1.3 [cite: 3]
  • openpyxl>=3.0 (for Excel support) [cite: 3]
  • rapidfuzz>=3.0 [cite: 3]

Core Features

1. Arabic Text Normalization

By default, fuzzylookup applies normalization to Arabic text (normalize_arabic=True) [cite: 2]. This process improves match quality by:

  • Removing tashkeel (diacritics) [cite: 2].
  • Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
  • Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
  • Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].

2. Positional Name-Aware Scoring

A standout feature is the name_aware scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but fuzzylookup can punish wrong token orders when name_aware=True is enabled [cite: 2].

  • It compares names token-by-token in order [cite: 2].
  • The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
  • It blends this positional score with WRatio to handle both typos and correct word order gracefully [cite: 2].
  • If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].

API Reference

FuzzyLookup Class

The primary entry point is the FuzzyLookup class [cite: 1, 2].

Parameters:

  • source (str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].
  • column (str): The column name to match against [cite: 2].
  • scorer (str): Matching algorithm to use (ratio, partial, token_sort, token_set, wratio). Defaults to wratio [cite: 2].
  • normalize_arabic (bool): Whether to strip diacritics and normalize characters. Defaults to True [cite: 2].
  • name_aware (bool): Enables positional name scoring. Recommended for full person names. Defaults to False [cite: 2].
  • encoding (str): File encoding for CSVs. Defaults to utf-8 [cite: 2].

Methods

  • lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None): Returns the top N best matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculated score, and _index [cite: 2].
  • lookup_best(query: str, min_score: float = 0.0, columns: list = None): Returns only the single best match, or None if no match meets the minimum score [cite: 2].
  • lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None): Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].

Properties

  • columns: Returns a list of the columns available in the loaded dataframe [cite: 2].
  • shape: Returns a tuple representing the shape of the underlying dataframe [cite: 2].

Workflow Example

Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:

from fuzzylookup import FuzzyLookup

# 1. Initialize the lookup instance with a dataset
fl = FuzzyLookup("names.csv", column="name", name_aware=True)

# 2. Perform a standard lookup for the top 3 matches
results = fl.lookup("محمد كمال", top_n=3)

# 3. Perform a strict lookup requiring a high match score
best_match = fl.lookup_best("كمال محمد", min_score=85.0)

# 4. Batch processing multiple names
batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzylookup-0.0.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fuzzylookup-0.0.1-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file fuzzylookup-0.0.1.tar.gz.

File metadata

  • Download URL: fuzzylookup-0.0.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fuzzylookup-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e9007a3c0fe477cc73ae5b4d0f2efe82b6736fc253805e291c7daa8e160474ff
MD5 34a2a28363aab6512241d93840a63d4d
BLAKE2b-256 f121078d38617263083454ee2b9ad8780e11684bcb3ee791b838f1521f1c3359

See more details on using hashes here.

Provenance

The following attestation bundles were made for fuzzylookup-0.0.1.tar.gz:

Publisher: python-publish.yml on Moda141/Fuzzylookup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fuzzylookup-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: fuzzylookup-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fuzzylookup-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 154175edc99403f277a21376afc1c251dc74bc2b0817dd72bdc9b7360aaaace2
MD5 2667043db24123c3775481406042d2e9
BLAKE2b-256 63fb51774656b020e6de9ffcf0ff65e3858a2d1b6101f8a02db89a342bc75360

See more details on using hashes here.

Provenance

The following attestation bundles were made for fuzzylookup-0.0.1-py3-none-any.whl:

Publisher: python-publish.yml on Moda141/Fuzzylookup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page