Advanced multi-algorithm string similarity and matching engine

These details have not been verified by PyPI

Project description

StringMatcher

Advanced multi-algorithm string similarity and matching engine for Python.

Compare strings, detect duplicates, find fuzzy matches, and link records with high accuracy.

Installation

pip install string-matcher

Requirements: Python 3.8+

Quick Start

Python API

from string_matcher import compare_strings

# Compare two strings
score = compare_strings("hello", "hallo")
print(f"Similarity: {score}%")  # Output: 83%

Command Line

string-matcher "hello" "hallo"

File Processing

string-matcher input.json output.json name company

Features

✅ 6+ complementary matching algorithms
✅ CLI interface
✅ Python API (function & class-based)
✅ Batch processing (fast)
✅ JSON file processing
✅ Unicode/UTF-8 support
✅ Case-insensitive matching
✅ Whitespace normalization
✅ Production-ready
✅ Type hints included

Usage Guide

1. Basic String Comparison

from string_matcher import compare_strings

# Identical strings
score = compare_strings("hello", "hello")
print(score)  # 100%

# Similar strings
score = compare_strings("hello", "hallo")
print(score)  # 83%

# Different strings
score = compare_strings("hello", "world")
print(score)  # 0%

# Case insensitive (automatic)
score = compare_strings("Hello", "HELLO")
print(score)  # 100%

2. Batch Comparison (Multiple Pairs)

Compare multiple string pairs faster than looping:

from string_matcher import compare_batch

pairs = [
    ("hello", "hallo"),
    ("world", "world"),
    ("test", "testing"),
    ("foo", "bar"),
]

results = compare_batch(pairs)

for result in results:
    print(f"{result['string1']} vs {result['string2']}: {result['score']}%")

Output:

hello vs hallo: 83%
world vs world: 100%
test vs testing: 92%
foo vs bar: 0%

3. Object-Oriented Interface

Use the StringMatcher class for multiple comparisons:

from string_matcher import StringMatcher

matcher = StringMatcher()

# Single comparison
score = matcher.compare("apple", "apple")
print(score)  # 100%

# Batch comparison
pairs = [("cat", "cat"), ("dog", "dog"), ("bird", "tree")]
results = matcher.compare_batch(pairs)

for r in results:
    print(f"{r['string1']} vs {r['string2']}: {r['score']}%")

4. File Processing

Compare fields in JSON files:

from string_matcher import process_json_file

process_json_file("input.json", "output.json", "name", "company")

Input:

[
  {"name": "John Smith", "company": "Apple Inc"},
  {"name": "Jane Doe", "company": "Microsoft"}
]

Output:

[
  {"name": "John Smith", "company": "Apple Inc", "similarity_score": 25},
  {"name": "Jane Doe", "company": "Microsoft", "similarity_score": 0}
]

Examples

Find Best Match from List

from string_matcher import compare_strings

target = "python programming"
options = ["python coding", "java programming", "python scripts"]

best_match = max(
    [(opt, compare_strings(target, opt)) for opt in options],
    key=lambda x: x[1]
)

print(f"Best match: {best_match[0]} ({best_match[1]}%)")
# Best match: python scripts (95%)

Duplicate Detection

from string_matcher import compare_batch

records = [
    ("John Smith", "Jon Smith"),
    ("Apple Inc", "Apple Inc"),
    ("Microsoft", "Microsft"),
]

results = compare_batch(records)

for r in results:
    if r['score'] >= 85:
        print(f"Duplicate: {r['string1']} ≈ {r['string2']}")

Deduplication

from string_matcher import compare_strings

data = ["Apple Inc", "Apple Inc.", "APPLE INC", "Microsoft"]
threshold = 90
groups = {}

for item in data:
    matched = False
    for group_key in groups:
        if compare_strings(item, group_key) >= threshold:
            groups[group_key].append(item)
            matched = True
            break
    if not matched:
        groups[item] = [item]

for key, items in groups.items():
    print(f"{key}: {items}")

Fuzzy Search

from string_matcher import compare_strings

def search(query, database, threshold=70):
    results = []
    for item in database:
        score = compare_strings(query, item)
        if score >= threshold:
            results.append((item, score))
    return sorted(results, key=lambda x: x[1], reverse=True)

database = ["Python Guide", "Java Tutorial", "Python Tips", "Web Dev"]
results = search("python", database)

for item, score in results:
    print(f"{item} ({score}%)")

Data Cleaning

from string_matcher import compare_strings

# Normalize messy data
companies = [
    "Apple Inc",
    "apple inc.",
    "APPLE INC",
    "Microsoft Corp",
    "microsoft",
]

canonical = {}
threshold = 85

for company in companies:
    matched = False
    for key in canonical:
        if compare_strings(company, key) >= threshold:
            canonical[key].append(company)
            matched = True
            break
    if not matched:
        canonical[company] = [company]

for canonical_name, variations in canonical.items():
    print(f"Canonical: {canonical_name}")
    for var in variations:
        print(f"  └─ {var}")

CLI Usage

Compare Two Strings

string-matcher "hello" "hallo"

Process JSON File

string-matcher input.json output.json field1 field2

Scoring Guide

Score	Meaning	Example
100%	Perfect match	"hello" vs "hello"
80-99%	Very similar	"hello" vs "hallo"
60-79%	Similar	"test" vs "testing"
40-59%	Somewhat similar	"python" vs "java"
0-39%	Different	"hello" vs "world"

Recommended Thresholds:

Duplicate detection: 85-95%
Fuzzy matching: 70-85%
Search relevance: 60-75%

Use Cases

✅ Duplicate Detection - Find and remove duplicate records
✅ Fuzzy Matching - Match similar but not identical strings
✅ Data Deduplication - Clean up messy data
✅ Record Linking - Link records across databases
✅ Search Engine - Find best matches for queries
✅ Typo Detection - Find and correct spelling errors
✅ Address Matching - Match addresses with variations
✅ Company Name Matching - Handle company name variations

API Reference

Functions

`compare_strings(str1, str2) -> int`

Compare two strings and return similarity score (0-100).

score = compare_strings("hello", "hallo")  # 83

`compare_batch(pairs) -> List[Dict]`

Compare multiple string pairs at once.

results = compare_batch([("a", "b"), ("c", "d")])
# Returns: [{'string1': 'a', 'string2': 'b', 'score': 50}, ...]

`process_json_file(input_file, output_file, field1, field2)`

Compare two fields in a JSON file.

process_json_file("in.json", "out.json", "name", "company")

Classes

`StringMatcher`

Object-oriented interface for string matching.

matcher = StringMatcher()
score = matcher.compare("hello", "hallo")
results = matcher.compare_batch([...])

Tips & Best Practices

Use batch processing - Faster than looping:

# Fast
results = compare_batch(pairs)

# Slow (avoid)
results = [compare_strings(p[0], p[1]) for p in pairs]

Set appropriate thresholds - Different use cases need different thresholds:

if compare_strings(str1, str2) >= 80:
    print("Match found!")

Pre-filter data - Process only relevant pairs:

pairs = [(a, b) for a, b in data if len(a) > 3]
results = compare_batch(pairs)

Handle edge cases - Always validate input:

if str1 and str2:
    score = compare_strings(str1, str2)

Supported Python Versions

Python 3.8
Python 3.9
Python 3.10
Python 3.11
Python 3.12
Python 3.13+

Dependencies

fuzzywuzzy - Fuzzy string matching
python-Levenshtein - Levenshtein distance
jellyfish - Additional string metrics
nltk - Natural Language Toolkit

All installed automatically with pip install string-matcher.

Performance

Single comparison: ~1ms
Batch processing: ~0.1ms per pair (when using compare_batch)
Memory efficient: Optimized for large datasets
Supports Unicode and special characters

License

MIT License - See LICENSE file for details.

Version History

v1.0.4 - Added comprehensive usage guide to README
v1.0.3 - Fixed UTF-8 encoding compatibility
v1.0.2 - Removed broken links, cleaned configuration
v1.0.1 - Fixed PyPI project links
v1.0.0 - Initial release

Support

PyPI: https://pypi.org/project/string-matcher/
Installation: pip install string-matcher

Start using StringMatcher today! 🚀

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.6

Jun 23, 2026

This version

1.0.5

Jun 23, 2026

1.0.4

Jun 23, 2026

1.0.3

Jun 23, 2026

1.0.2

Jun 23, 2026

1.0.1

Jun 23, 2026

1.0.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

string_matcher-1.0.5.tar.gz (18.9 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

string_matcher-1.0.5-py3-none-any.whl (16.4 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file string_matcher-1.0.5.tar.gz.

File metadata

Download URL: string_matcher-1.0.5.tar.gz
Upload date: Jun 23, 2026
Size: 18.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for string_matcher-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`df2b2dfc24f09c39007a28e93b98aeb255455206d3ae3681f5496b8a25f2f84f`
MD5	`4023cca2a5ffd9eab2e580dca79c50d9`
BLAKE2b-256	`c7b6790b60942f3309270d7eabb9615269307294dd3272d2ece1880dbebfc1f1`

See more details on using hashes here.

File details

Details for the file string_matcher-1.0.5-py3-none-any.whl.

File metadata

Download URL: string_matcher-1.0.5-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 16.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for string_matcher-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`291db68366ed228cfe92ab4536af8d607abbe8d9cb1174fcbb24defa74ed35d0`
MD5	`7ed675aa0f09f4b3fd853d93e7eb932d`
BLAKE2b-256	`ddf0bf36398e46bb0c74d33d067534eed5b124b473bed08f5fa8b99f42e7607c`

See more details on using hashes here.

string-matcher 1.0.5

Navigation

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Project description

StringMatcher

Installation

Quick Start

Python API

Command Line

File Processing

Features

Usage Guide

1. Basic String Comparison

2. Batch Comparison (Multiple Pairs)

3. Object-Oriented Interface

4. File Processing

Examples

Find Best Match from List

Duplicate Detection

Deduplication

Fuzzy Search

Data Cleaning

CLI Usage

Compare Two Strings

Process JSON File

Scoring Guide

Use Cases

API Reference

Functions

compare_strings(str1, str2) -> int

compare_batch(pairs) -> List[Dict]

process_json_file(input_file, output_file, field1, field2)

Classes

StringMatcher

Tips & Best Practices

Supported Python Versions

Dependencies

Performance

License

Version History

Support

Project details

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`compare_strings(str1, str2) -> int`

`compare_batch(pairs) -> List[Dict]`

`process_json_file(input_file, output_file, field1, field2)`

`StringMatcher`