Advanced multi-algorithm string similarity and matching engine
Project description
StringMatcher
Advanced multi-algorithm string similarity and matching engine for Python.
Compare strings, detect duplicates, find fuzzy matches, and link records with high accuracy.
Installation
pip install string-matcher
Requirements: Python 3.8+
Quick Start
Python API
from string_matcher import compare_strings
# Compare two strings
score = compare_strings("hello", "hallo")
print(f"Similarity: {score}%") # Output: 83%
Command Line
string-matcher "hello" "hallo"
File Processing
string-matcher input.json output.json name company
Features
- ✅ 6+ complementary matching algorithms
- ✅ CLI interface
- ✅ Python API (function & class-based)
- ✅ Batch processing (fast)
- ✅ JSON file processing
- ✅ Unicode/UTF-8 support
- ✅ Case-insensitive matching
- ✅ Whitespace normalization
- ✅ Production-ready
- ✅ Type hints included
Usage Guide
1. Basic String Comparison
from string_matcher import compare_strings
# Identical strings
score = compare_strings("hello", "hello")
print(score) # 100%
# Similar strings
score = compare_strings("hello", "hallo")
print(score) # 83%
# Different strings
score = compare_strings("hello", "world")
print(score) # 0%
# Case insensitive (automatic)
score = compare_strings("Hello", "HELLO")
print(score) # 100%
2. Batch Comparison (Multiple Pairs)
Compare multiple string pairs faster than looping:
from string_matcher import compare_batch
pairs = [
("hello", "hallo"),
("world", "world"),
("test", "testing"),
("foo", "bar"),
]
results = compare_batch(pairs)
for result in results:
print(f"{result['string1']} vs {result['string2']}: {result['score']}%")
Output:
hello vs hallo: 83%
world vs world: 100%
test vs testing: 92%
foo vs bar: 0%
3. Object-Oriented Interface
Use the StringMatcher class for multiple comparisons:
from string_matcher import StringMatcher
matcher = StringMatcher()
# Single comparison
score = matcher.compare("apple", "apple")
print(score) # 100%
# Batch comparison
pairs = [("cat", "cat"), ("dog", "dog"), ("bird", "tree")]
results = matcher.compare_batch(pairs)
for r in results:
print(f"{r['string1']} vs {r['string2']}: {r['score']}%")
4. File Processing
Compare fields in JSON files:
from string_matcher import process_json_file
process_json_file("input.json", "output.json", "name", "company")
Input:
[
{"name": "John Smith", "company": "Apple Inc"},
{"name": "Jane Doe", "company": "Microsoft"}
]
Output:
[
{"name": "John Smith", "company": "Apple Inc", "similarity_score": 25},
{"name": "Jane Doe", "company": "Microsoft", "similarity_score": 0}
]
Examples
Find Best Match from List
from string_matcher import compare_strings
target = "python programming"
options = ["python coding", "java programming", "python scripts"]
best_match = max(
[(opt, compare_strings(target, opt)) for opt in options],
key=lambda x: x[1]
)
print(f"Best match: {best_match[0]} ({best_match[1]}%)")
# Best match: python scripts (95%)
Duplicate Detection
from string_matcher import compare_batch
records = [
("John Smith", "Jon Smith"),
("Apple Inc", "Apple Inc"),
("Microsoft", "Microsft"),
]
results = compare_batch(records)
for r in results:
if r['score'] >= 85:
print(f"Duplicate: {r['string1']} ≈ {r['string2']}")
Deduplication
from string_matcher import compare_strings
data = ["Apple Inc", "Apple Inc.", "APPLE INC", "Microsoft"]
threshold = 90
groups = {}
for item in data:
matched = False
for group_key in groups:
if compare_strings(item, group_key) >= threshold:
groups[group_key].append(item)
matched = True
break
if not matched:
groups[item] = [item]
for key, items in groups.items():
print(f"{key}: {items}")
Fuzzy Search
from string_matcher import compare_strings
def search(query, database, threshold=70):
results = []
for item in database:
score = compare_strings(query, item)
if score >= threshold:
results.append((item, score))
return sorted(results, key=lambda x: x[1], reverse=True)
database = ["Python Guide", "Java Tutorial", "Python Tips", "Web Dev"]
results = search("python", database)
for item, score in results:
print(f"{item} ({score}%)")
Data Cleaning
from string_matcher import compare_strings
# Normalize messy data
companies = [
"Apple Inc",
"apple inc.",
"APPLE INC",
"Microsoft Corp",
"microsoft",
]
canonical = {}
threshold = 85
for company in companies:
matched = False
for key in canonical:
if compare_strings(company, key) >= threshold:
canonical[key].append(company)
matched = True
break
if not matched:
canonical[company] = [company]
for canonical_name, variations in canonical.items():
print(f"Canonical: {canonical_name}")
for var in variations:
print(f" └─ {var}")
CLI Usage
Compare Two Strings
string-matcher "hello" "hallo"
Process JSON File
string-matcher input.json output.json field1 field2
Scoring Guide
| Score | Meaning | Example |
|---|---|---|
| 100% | Perfect match | "hello" vs "hello" |
| 80-99% | Very similar | "hello" vs "hallo" |
| 60-79% | Similar | "test" vs "testing" |
| 40-59% | Somewhat similar | "python" vs "java" |
| 0-39% | Different | "hello" vs "world" |
Recommended Thresholds:
- Duplicate detection: 85-95%
- Fuzzy matching: 70-85%
- Search relevance: 60-75%
Use Cases
✅ Duplicate Detection - Find and remove duplicate records
✅ Fuzzy Matching - Match similar but not identical strings
✅ Data Deduplication - Clean up messy data
✅ Record Linking - Link records across databases
✅ Search Engine - Find best matches for queries
✅ Typo Detection - Find and correct spelling errors
✅ Address Matching - Match addresses with variations
✅ Company Name Matching - Handle company name variations
API Reference
Functions
compare_strings(str1, str2) -> int
Compare two strings and return similarity score (0-100).
score = compare_strings("hello", "hallo") # 83
compare_batch(pairs) -> List[Dict]
Compare multiple string pairs at once.
results = compare_batch([("a", "b"), ("c", "d")])
# Returns: [{'string1': 'a', 'string2': 'b', 'score': 50}, ...]
process_json_file(input_file, output_file, field1, field2)
Compare two fields in a JSON file.
process_json_file("in.json", "out.json", "name", "company")
Classes
StringMatcher
Object-oriented interface for string matching.
matcher = StringMatcher()
score = matcher.compare("hello", "hallo")
results = matcher.compare_batch([...])
Tips & Best Practices
-
Use batch processing - Faster than looping:
# Fast results = compare_batch(pairs) # Slow (avoid) results = [compare_strings(p[0], p[1]) for p in pairs]
-
Set appropriate thresholds - Different use cases need different thresholds:
if compare_strings(str1, str2) >= 80: print("Match found!")
-
Pre-filter data - Process only relevant pairs:
pairs = [(a, b) for a, b in data if len(a) > 3] results = compare_batch(pairs)
-
Handle edge cases - Always validate input:
if str1 and str2: score = compare_strings(str1, str2)
Supported Python Versions
- Python 3.8
- Python 3.9
- Python 3.10
- Python 3.11
- Python 3.12
- Python 3.13+
Dependencies
fuzzywuzzy- Fuzzy string matchingpython-Levenshtein- Levenshtein distancejellyfish- Additional string metricsnltk- Natural Language Toolkit
All installed automatically with pip install string-matcher.
Performance
- Single comparison: ~1ms
- Batch processing: ~0.1ms per pair (when using compare_batch)
- Memory efficient: Optimized for large datasets
- Supports Unicode and special characters
License
MIT License - See LICENSE file for details.
Version History
- v1.0.4 - Added comprehensive usage guide to README
- v1.0.3 - Fixed UTF-8 encoding compatibility
- v1.0.2 - Removed broken links, cleaned configuration
- v1.0.1 - Fixed PyPI project links
- v1.0.0 - Initial release
Support
- PyPI: https://pypi.org/project/string-matcher/
- Installation:
pip install string-matcher
Start using StringMatcher today! 🚀
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file string_matcher-1.0.5.tar.gz.
File metadata
- Download URL: string_matcher-1.0.5.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df2b2dfc24f09c39007a28e93b98aeb255455206d3ae3681f5496b8a25f2f84f
|
|
| MD5 |
4023cca2a5ffd9eab2e580dca79c50d9
|
|
| BLAKE2b-256 |
c7b6790b60942f3309270d7eabb9615269307294dd3272d2ece1880dbebfc1f1
|
File details
Details for the file string_matcher-1.0.5-py3-none-any.whl.
File metadata
- Download URL: string_matcher-1.0.5-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
291db68366ed228cfe92ab4536af8d607abbe8d9cb1174fcbb24defa74ed35d0
|
|
| MD5 |
7ed675aa0f09f4b3fd853d93e7eb932d
|
|
| BLAKE2b-256 |
ddf0bf36398e46bb0c74d33d067534eed5b124b473bed08f5fa8b99f42e7607c
|