A library for matching and comparing company names using a fine-tuned sentence transformer model
Project description
Company Name Matcher
Company Name Matcher is a library for efficient matching of company names using vector search. It leverages a language model to generate embeddings specifically tailored for company names.
Advantages over Traditional Methods
While traditional string matching algorithms like those used in RapidFuzz are fast for small datasets, Company Name Matcher offers several advantages, especially when dealing with larger datasets:
-
Scalability: Embedding-based approach offers superior scalability for larger datasets through efficient vector search techniques. Unlike traditional methods that require comparing every name in list A with every name in list B (O(n * m) complexity), this approach reduces computational complexity to O(n log m) by leveraging optimized vector search, where n and m represent the lengths of the respective name lists.
-
Contextual Understanding: Embeddings capture the context and semantics of company names, allowing for more intelligent matching that goes beyond simple string similarity.
-
Customizability: The underlying model can be fine-tuned on domain-specific data, allowing for improved performance in specialized use cases.
🚀 Installation
pip install company-name-matcher
An optional installation with pip install . --no-binary scikit-learn is recommended to fix an OpenMP compatibility issue with sklearn.
📣 Features
- K-Means approximated matching: Use vector search with either exact or approximate matching
- Easily expand index: Easily add new companies to the existing index without rebuilding the index from scratch
- Efficient batch processing: Process multiple companies in parallel and with caching for faster matching
📚 Quick Start
1. Basic Usage
from company_name_matcher import CompanyNameMatcher
# Initialize with default model
matcher = CompanyNameMatcher("paraphrase-multilingual-MiniLM-L12-v2")
# Or initialize with custom preprocessing
def preprocess_name(name):
return name.lower().strip()
matcher = CompanyNameMatcher(
"paraphrase-multilingual-MiniLM-L12-v2",
preprocess_fn=preprocess_name
)
# Compare two company names
similarity = matcher.compare_companies("Apple Inc", "Apple Incorporated")
print(f"Similarity: {similarity}")
2. Bulk Matching with Vector Search
For large datasets, you can use vector search with either exact or approximate matching:
# Your list of companies to match against
companies_to_match = ["Microsoft Corporation", "Apple Inc", "Google LLC", ...]
# Build and save index (only needed once)
matcher.build_index(
companies_to_match,
n_clusters=20, # Adjust based on dataset size
save_dir="index_files" # Optional: save index to disk
)
# Or load a previously saved index
matcher.load_index(load_dir="index_files")
# 1. Exact Search (more accurate but slower)
exact_matches = matcher.find_matches(
"Apple",
threshold=0.7,
k=5,
use_approx=False
)
print("Exact matches:", exact_matches)
# 2. Approximate Search (faster but may miss some matches)
approx_matches = matcher.find_matches(
"Apple",
threshold=0.7,
k=5,
use_approx=True
)
print("Approximate matches:", approx_matches)
3. Working with Embeddings
You can also work directly with the embeddings:
# Get embedding for a single company
embedding = matcher.get_embedding("Apple Inc")
print(f"Embedding shape: {embedding.shape}")
# Get embeddings for multiple companies
embeddings = matcher.get_embeddings(["Microsoft", "Google"])
print(f"Embeddings shape: {embeddings.shape}")
📊 Performance Considerations
- For small datasets (<10,000 companies), use exact matching (
use_approx=False) - For large datasets, use approximate matching (
use_approx=True) with appropriaten_clusters - When using approximate matching:
- Build the index once and save it to disk
- Load the index for subsequent uses
- Adjust
n_clustersbased on your dataset size and speed/accuracy requirements
🤖 (Complementary) fine-tuned model
While you can load your own model into CompanyNameMatcher, we provide our complementary fine-tuned model avaliable for download here on Google Drive. See demo here.
-
Fine-tuned Embeddings: We use a lightweight multilingual sentence transformer model (sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) fine-tuned specifically for company names. This model was trained using contrastive learning, minimizing the cosine distance between similar company names.
-
Special Tokens: During the training process, we added special tokens # to the training data. These tokens guide the model's understanding, explicitly informing it that it's embedding company names. This results in more accurate and context-aware embeddings.
-
Cosine Similarity: We use cosine similarity to compare the resulting embeddings, providing a robust measure of similarity that works well with high-dimensional data.
Performance Comparison
Here's a comparison of different matching approaches on our test dataset:
| Metric | Fine-tuned Matcher | Default Matcher | RapidFuzz |
|---|---|---|---|
| Accuracy | 0.910 | 0.780 | 0.690 |
| Precision | 0.918 | 0.719 | 0.807 |
| Recall | 0.900 | 0.920 | 0.500 |
| F1 Score | 0.909 | 0.807 | 0.617 |
While RapidFuzz is faster, Company Name Matcher provides better accuracy and scalability (as the lists for matching increase in size, we can use k-means approximated matching).
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file company_name_matcher-0.1.2.tar.gz.
File metadata
- Download URL: company_name_matcher-0.1.2.tar.gz
- Upload date:
- Size: 16.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3aac281739284aa2958cf755640079f84387f7c46b8927e64a6bc1ef04c0f364
|
|
| MD5 |
e26d0c5390eeecd2f3ff6344c0120105
|
|
| BLAKE2b-256 |
52a12de4b7a6f8e0c708787381c2a0e91d91cc6a748a823b7c108ba90a4b35d2
|
File details
Details for the file company_name_matcher-0.1.2-py3-none-any.whl.
File metadata
- Download URL: company_name_matcher-0.1.2-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a75ade2c033c7aaafd2001c199f65fa57935df5a451469a8cc7afd229371a788
|
|
| MD5 |
28ac5f97114a596ae7a7ae1f163f2dd6
|
|
| BLAKE2b-256 |
da530f013823a57682d7481bd6946b56120e76211e8e13f4558f25e756122aac
|