Skip to main content

A library for matching and comparing company names using a fine-tuned sentence transformer model

Project description

Company Name Matcher Logo

Company Name Matcher

PyPI version License: MIT Python 3.9+ GitHub stars

Company Name Matcher is a library for efficient matching of company names using vector search. It leverages a language model to generate embeddings specifically tailored for company names.

Advantages over Traditional Methods

While traditional string matching algorithms like those used in RapidFuzz are fast for small datasets, Company Name Matcher offers several advantages, especially when dealing with larger datasets:

  1. Scalability: Embedding-based approach offers superior scalability for larger datasets through efficient vector search techniques. Unlike traditional methods that require comparing every name in list A with every name in list B (O(n * m) complexity), this approach reduces computational complexity to O(n log m) by leveraging optimized vector search, where n and m represent the lengths of the respective name lists.

  2. Contextual Understanding: Embeddings capture the context and semantics of company names, allowing for more intelligent matching that goes beyond simple string similarity.

  3. Customizability: The underlying model can be fine-tuned on domain-specific data, allowing for improved performance in specialized use cases.

🚀 Installation

pip install company-name-matcher

An optional installation with pip install . --no-binary scikit-learn is recommended to fix an OpenMP compatibility issue with sklearn.

📣 Features

  • K-Means approximated matching: Use vector search with either exact or approximate matching
  • Easily expand index: Easily add new companies to the existing index without rebuilding the index from scratch
  • Efficient batch processing: Process multiple companies in parallel and with caching for faster matching

📚 Quick Start

1. Basic Usage

from company_name_matcher import CompanyNameMatcher

# Initialize with default model
matcher = CompanyNameMatcher("paraphrase-multilingual-MiniLM-L12-v2")

# Or initialize with custom preprocessing
def preprocess_name(name):
    return name.lower().strip()

matcher = CompanyNameMatcher(
    "paraphrase-multilingual-MiniLM-L12-v2",
    preprocess_fn=preprocess_name
)

# Compare two company names
similarity = matcher.compare_companies("Apple Inc", "Apple Incorporated")
print(f"Similarity: {similarity}")

2. Bulk Matching with Vector Search

For large datasets, you can use vector search with either exact or approximate matching:

# Your list of companies to match against
companies_to_match = ["Microsoft Corporation", "Apple Inc", "Google LLC", ...]

# Build and save index (only needed once)
matcher.build_index(
    companies_to_match,
    n_clusters=20,  # Adjust based on dataset size
    save_dir="index_files"  # Optional: save index to disk
)

# Or load a previously saved index
matcher.load_index(load_dir="index_files")

# 1. Exact Search (more accurate but slower)
exact_matches = matcher.find_matches(
    "Apple",
    threshold=0.7,
    k=5,
    use_approx=False
)
print("Exact matches:", exact_matches)

# 2. Approximate Search (faster but may miss some matches)
approx_matches = matcher.find_matches(
    "Apple",
    threshold=0.7,
    k=5,
    use_approx=True
)
print("Approximate matches:", approx_matches)

3. Working with Embeddings

You can also work directly with the embeddings:

# Get embedding for a single company
embedding = matcher.get_embedding("Apple Inc")
print(f"Embedding shape: {embedding.shape}")

# Get embeddings for multiple companies
embeddings = matcher.get_embeddings(["Microsoft", "Google"])
print(f"Embeddings shape: {embeddings.shape}")

📊 Performance Considerations

  1. For small datasets (<10,000 companies), use exact matching (use_approx=False)
  2. For large datasets, use approximate matching (use_approx=True) with appropriate n_clusters
  3. When using approximate matching:
    • Build the index once and save it to disk
    • Load the index for subsequent uses
    • Adjust n_clusters based on your dataset size and speed/accuracy requirements

🤖 (Complementary) fine-tuned model

While you can load your own model into CompanyNameMatcher, we provide our complementary fine-tuned model avaliable for download here on Google Drive. See demo here.

  1. Fine-tuned Embeddings: We use a lightweight multilingual sentence transformer model (sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) fine-tuned specifically for company names. This model was trained using contrastive learning, minimizing the cosine distance between similar company names.

  2. Special Tokens: During the training process, we added special tokens # to the training data. These tokens guide the model's understanding, explicitly informing it that it's embedding company names. This results in more accurate and context-aware embeddings.

  3. Cosine Similarity: We use cosine similarity to compare the resulting embeddings, providing a robust measure of similarity that works well with high-dimensional data.

Performance Comparison

Here's a comparison of different matching approaches on our test dataset:

Metric Fine-tuned Matcher Default Matcher RapidFuzz
Accuracy 0.910 0.780 0.690
Precision 0.918 0.719 0.807
Recall 0.900 0.920 0.500
F1 Score 0.909 0.807 0.617

While RapidFuzz is faster, Company Name Matcher provides better accuracy and scalability (as the lists for matching increase in size, we can use k-means approximated matching).

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

company_name_matcher-0.1.2.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

company_name_matcher-0.1.2-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file company_name_matcher-0.1.2.tar.gz.

File metadata

  • Download URL: company_name_matcher-0.1.2.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for company_name_matcher-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3aac281739284aa2958cf755640079f84387f7c46b8927e64a6bc1ef04c0f364
MD5 e26d0c5390eeecd2f3ff6344c0120105
BLAKE2b-256 52a12de4b7a6f8e0c708787381c2a0e91d91cc6a748a823b7c108ba90a4b35d2

See more details on using hashes here.

File details

Details for the file company_name_matcher-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for company_name_matcher-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a75ade2c033c7aaafd2001c199f65fa57935df5a451469a8cc7afd229371a788
MD5 28ac5f97114a596ae7a7ae1f163f2dd6
BLAKE2b-256 da530f013823a57682d7481bd6946b56120e76211e8e13f4558f25e756122aac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page