A package for Fuzzy Lookup
Project description
FuzzyLookup Documentation
Overview
fuzzylookup (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages pandas for data handling and rapidfuzz for high-performance string matching [cite: 2, 3].
Installation
The package requires Python 3.8 or higher [cite: 3]. Dependencies include:
pandas>=1.3[cite: 3]openpyxl>=3.0(for Excel support) [cite: 3]rapidfuzz>=3.0[cite: 3]
Core Features
1. Arabic Text Normalization
By default, fuzzylookup applies normalization to Arabic text (normalize_arabic=True) [cite: 2]. This process improves match quality by:
- Removing tashkeel (diacritics) [cite: 2].
- Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
- Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
- Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].
2. Positional Name-Aware Scoring
A standout feature is the name_aware scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but fuzzylookup can punish wrong token orders when name_aware=True is enabled [cite: 2].
- It compares names token-by-token in order [cite: 2].
- The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
- It blends this positional score with
WRatioto handle both typos and correct word order gracefully [cite: 2]. - If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].
API Reference
FuzzyLookup Class
The primary entry point is the FuzzyLookup class [cite: 1, 2].
Parameters:
source(str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].column(str): The column name to match against [cite: 2].scorer(str): Matching algorithm to use (ratio,partial,token_sort,token_set,wratio). Defaults towratio[cite: 2].normalize_arabic(bool): Whether to strip diacritics and normalize characters. Defaults toTrue[cite: 2].name_aware(bool): Enables positional name scoring. Recommended for full person names. Defaults toFalse[cite: 2].encoding(str): File encoding for CSVs. Defaults toutf-8[cite: 2].
Methods
lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None): Returns the topNbest matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculatedscore, and_index[cite: 2].lookup_best(query: str, min_score: float = 0.0, columns: list = None): Returns only the single best match, orNoneif no match meets the minimum score [cite: 2].lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None): Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].
Properties
columns: Returns a list of the columns available in the loaded dataframe [cite: 2].shape: Returns a tuple representing the shape of the underlying dataframe [cite: 2].
Workflow Example
Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:
from fuzzylookup import FuzzyLookup
# 1. Initialize the lookup instance with a dataset
fl = FuzzyLookup("names.csv", column="name", name_aware=True)
# 2. Perform a standard lookup for the top 3 matches
results = fl.lookup("محمد كمال", top_n=3)
# 3. Perform a strict lookup requiring a high match score
best_match = fl.lookup_best("كمال محمد", min_score=85.0)
# 4. Batch processing multiple names
batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fuzzylookup-0.0.1.tar.gz.
File metadata
- Download URL: fuzzylookup-0.0.1.tar.gz
- Upload date:
- Size: 6.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9007a3c0fe477cc73ae5b4d0f2efe82b6736fc253805e291c7daa8e160474ff
|
|
| MD5 |
34a2a28363aab6512241d93840a63d4d
|
|
| BLAKE2b-256 |
f121078d38617263083454ee2b9ad8780e11684bcb3ee791b838f1521f1c3359
|
Provenance
The following attestation bundles were made for fuzzylookup-0.0.1.tar.gz:
Publisher:
python-publish.yml on Moda141/Fuzzylookup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fuzzylookup-0.0.1.tar.gz -
Subject digest:
e9007a3c0fe477cc73ae5b4d0f2efe82b6736fc253805e291c7daa8e160474ff - Sigstore transparency entry: 1720737750
- Sigstore integration time:
-
Permalink:
Moda141/Fuzzylookup@e71a18a5f5092aa2d3e20acbd1c5468433bf2dd6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Moda141
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e71a18a5f5092aa2d3e20acbd1c5468433bf2dd6 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file fuzzylookup-0.0.1-py3-none-any.whl.
File metadata
- Download URL: fuzzylookup-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
154175edc99403f277a21376afc1c251dc74bc2b0817dd72bdc9b7360aaaace2
|
|
| MD5 |
2667043db24123c3775481406042d2e9
|
|
| BLAKE2b-256 |
63fb51774656b020e6de9ffcf0ff65e3858a2d1b6101f8a02db89a342bc75360
|
Provenance
The following attestation bundles were made for fuzzylookup-0.0.1-py3-none-any.whl:
Publisher:
python-publish.yml on Moda141/Fuzzylookup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fuzzylookup-0.0.1-py3-none-any.whl -
Subject digest:
154175edc99403f277a21376afc1c251dc74bc2b0817dd72bdc9b7360aaaace2 - Sigstore transparency entry: 1720737926
- Sigstore integration time:
-
Permalink:
Moda141/Fuzzylookup@e71a18a5f5092aa2d3e20acbd1c5468433bf2dd6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Moda141
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e71a18a5f5092aa2d3e20acbd1c5468433bf2dd6 -
Trigger Event:
workflow_dispatch
-
Statement type: