A library for matching and comparing company names using a fine-tuned sentence transformer model

These details have not been verified by PyPI

Project links

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.9

Project description

Company Name Matcher Logo

Company Name Matcher

Company Name Matcher is a library for efficient matching of company names using vector search. It leverages a language model to generate embeddings specifically tailored for company names.

Advantages over Traditional Methods

While traditional string matching algorithms like those used in RapidFuzz are fast for small datasets, Company Name Matcher offers several advantages, especially when dealing with larger datasets:

Scalability: Embedding-based approach offers superior scalability for larger datasets through efficient vector search techniques. Unlike traditional methods that require comparing every name in list A with every name in list B (O(n * m) complexity), this approach reduces computational complexity to O(n log m) by leveraging optimized vector search, where n and m represent the lengths of the respective name lists.
Contextual Understanding: Embeddings capture the context and semantics of company names, allowing for more intelligent matching that goes beyond simple string similarity.
Customizability: The underlying model can be fine-tuned on domain-specific data, allowing for improved performance in specialized use cases.

🚀 Installation

pip install company-name-matcher

An optional installation with pip install . --no-binary scikit-learn is recommended to fix an OpenMP compatibility issue with sklearn.

📣 Features

K-Means approximated matching: Use vector search with either exact or approximate matching
Easily expand index: Easily add new companies to the existing index without rebuilding the index from scratch
Efficient batch processing: Process multiple companies in parallel and with caching for faster matching

📚 Quick Start

1. Basic Usage

from company_name_matcher import CompanyNameMatcher

# Initialize with default model
matcher = CompanyNameMatcher("paraphrase-multilingual-MiniLM-L12-v2")

# Or initialize with custom preprocessing
def preprocess_name(name):
    return name.lower().strip()

matcher = CompanyNameMatcher(
    "paraphrase-multilingual-MiniLM-L12-v2",
    preprocess_fn=preprocess_name
)

# Compare two company names
similarity = matcher.compare_companies("Apple Inc", "Apple Incorporated")
print(f"Similarity: {similarity}")

2. Bulk Matching with Vector Search

For large datasets, you can use vector search with either exact or approximate matching:

# Your list of companies to match against
companies_to_match = ["Microsoft Corporation", "Apple Inc", "Google LLC", ...]

# Build and save index (only needed once)
matcher.build_index(
    companies_to_match,
    n_clusters=20,  # Adjust based on dataset size
    save_dir="index_files"  # Optional: save index to disk
)

# Or load a previously saved index
matcher.load_index(load_dir="index_files")

# 1. Exact Search (more accurate but slower)
exact_matches = matcher.find_matches(
    "Apple",
    threshold=0.7,
    k=5,
    use_approx=False
)
print("Exact matches:", exact_matches)

# 2. Approximate Search (faster but may miss some matches)
approx_matches = matcher.find_matches(
    "Apple",
    threshold=0.7,
    k=5,
    use_approx=True
)
print("Approximate matches:", approx_matches)

3. Working with Embeddings

You can also work directly with the embeddings:

# Get embedding for a single company
embedding = matcher.get_embedding("Apple Inc")
print(f"Embedding shape: {embedding.shape}")

# Get embeddings for multiple companies
embeddings = matcher.get_embeddings(["Microsoft", "Google"])
print(f"Embeddings shape: {embeddings.shape}")

📊 Performance Considerations

For small datasets (<10,000 companies), use exact matching (use_approx=False)
For large datasets, use approximate matching (use_approx=True) with appropriate n_clusters
When using approximate matching:
- Build the index once and save it to disk
- Load the index for subsequent uses
- Adjust n_clusters based on your dataset size and speed/accuracy requirements

🤖 (Complementary) fine-tuned model

While you can load your own model into CompanyNameMatcher, we provide our complementary fine-tuned model avaliable for download here on Google Drive. See demo here.

Fine-tuned Embeddings: We use a lightweight multilingual sentence transformer model (sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) fine-tuned specifically for company names. This model was trained using contrastive learning, minimizing the cosine distance between similar company names.
Special Tokens: During the training process, we added special tokens # to the training data. These tokens guide the model's understanding, explicitly informing it that it's embedding company names. This results in more accurate and context-aware embeddings.
Cosine Similarity: We use cosine similarity to compare the resulting embeddings, providing a robust measure of similarity that works well with high-dimensional data.

Performance Comparison

Here's a comparison of different matching approaches on our test dataset:

Metric	Fine-tuned Matcher	Default Matcher	RapidFuzz
Accuracy	0.910	0.780	0.690
Precision	0.918	0.719	0.807
Recall	0.900	0.920	0.500
F1 Score	0.909	0.807	0.617

While RapidFuzz is faster, Company Name Matcher provides better accuracy and scalability (as the lists for matching increase in size, we can use k-means approximated matching).

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.9

Release history Release notifications | RSS feed

This version

0.1.2

Apr 21, 2025

0.1.1

Dec 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

company_name_matcher-0.1.2.tar.gz (16.8 kB view details)

Uploaded Apr 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

company_name_matcher-0.1.2-py3-none-any.whl (11.3 kB view details)

Uploaded Apr 21, 2025 Python 3

File details

Details for the file company_name_matcher-0.1.2.tar.gz.

File metadata

Download URL: company_name_matcher-0.1.2.tar.gz
Upload date: Apr 21, 2025
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for company_name_matcher-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3aac281739284aa2958cf755640079f84387f7c46b8927e64a6bc1ef04c0f364`
MD5	`e26d0c5390eeecd2f3ff6344c0120105`
BLAKE2b-256	`52a12de4b7a6f8e0c708787381c2a0e91d91cc6a748a823b7c108ba90a4b35d2`

See more details on using hashes here.

File details

Details for the file company_name_matcher-0.1.2-py3-none-any.whl.

File metadata

Download URL: company_name_matcher-0.1.2-py3-none-any.whl
Upload date: Apr 21, 2025
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for company_name_matcher-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a75ade2c033c7aaafd2001c199f65fa57935df5a451469a8cc7afd229371a788`
MD5	`28ac5f97114a596ae7a7ae1f163f2dd6`
BLAKE2b-256	`da530f013823a57682d7481bd6946b56120e76211e8e13f4558f25e756122aac`

See more details on using hashes here.

company-name-matcher 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Company Name Matcher

Advantages over Traditional Methods

🚀 Installation

📣 Features

📚 Quick Start

1. Basic Usage

2. Bulk Matching with Vector Search

3. Working with Embeddings

📊 Performance Considerations

🤖 (Complementary) fine-tuned model

Performance Comparison

📝 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes