CanonMap - A Python library for entity canonicalization and mapping
Project description
CanonMap
A powerful Python library for intelligent entity matching and data canonicalization. CanonMap uses advanced techniques to identify, match, and standardize entities across your datasets.
Key Features
-
Multi-strategy Entity Matching: Combines multiple matching strategies for robust entity identification:
- Semantic matching (45%): Uses transformer embeddings for understanding meaning
- Fuzzy matching (35%): Handles typos and variations
- Initial matching (10%): Matches abbreviations and initials
- Keyword matching (5%): Matches individual words
- Phonetic matching (5%): Sound-based matching using Double Metaphone
-
Smart Scoring System: Sophisticated scoring with bonus points for:
- High semantic + fuzzy score combinations (+10 points)
- Perfect initial matches (+10 points)
- Perfect phonetic matches (+5 points)
- Penalties for mismatched high fuzzy/low semantic scores (-15 points)
-
Intelligent Entity Extraction:
- Automatic entity detection using spaCy NER
- Smart handling of name fields and patterns
- Configurable uniqueness ratios and length thresholds
- Support for both manual field selection and automatic extraction
-
Data Processing:
- CSV file processing with schema inference
- Metadata generation and management
- Entity normalization and standardization
- Support for custom field mapping
Installation
pip install canonmap
Dependencies
- Python 3.8 or higher
- spaCy and its English language model (automatically downloaded on first use)
Quick Start
from canonmap import CanonMap
# Initialize the library
canon = CanonMap()
# Generate artifacts from a CSV file
artifacts = canon.generate_artifacts(
csv_path="path/to/your/data.csv",
output_path="output", # Optional: directory to save artifacts
name="my_data", # Base name for output files
entity_fields=["name", "email"], # Optional: specify entity fields
use_other_fields_as_metadata=True, # Include other columns as metadata
num_rows=None, # Optional: limit number of rows to process
embed=True # Whether to compute and save embeddings
)
# The artifacts dictionary contains:
# - metadata: List of entity objects with their metadata
# - schema: Nested dictionary of data types and formats
# - paths: Dictionary of paths to saved artifacts
# - embeddings: Optional numpy array of entity embeddings
# Match entities against your data
matches = canon.match_entity(
entity_term="John Smith",
metadata_path=artifacts["paths"]["metadata"],
schema_path=artifacts["paths"]["schema"],
embedding_path=artifacts["paths"]["embeddings"], # Required for semantic search
top_k=5, # Maximum number of results to return
threshold=80.0, # Minimum score threshold (default: 0)
field_filter=["name", "contact_name"], # Optional: restrict matching to specific fields
use_semantic_search=True, # Enable semantic search (default: False)
weights=None # Optional: customize matching strategy weights
)
# Process results
for match in matches:
print(f"Entity: {match['entity']}")
print(f"Score: {match['score']}")
print(f"Passes: {match['passes']}") # Number of matching strategies that passed
print(f"Metadata: {match['metadata']}")
print("---")
Advanced Usage
Custom Matching Weights
# Customize the matching strategy weights
custom_weights = {
'semantic': 0.50, # Increase semantic matching importance
'fuzzy': 0.30, # Decrease fuzzy matching
'initial': 0.10, # Keep initial matching
'keyword': 0.05, # Keep keyword matching
'phonetic': 0.05 # Keep phonetic matching
}
matches = canon.match_entity(
entity_term="John Smith",
metadata_path="metadata.pkl",
schema_path="schema.pkl",
embedding_path="embeddings.npz", # Required for semantic search
weights=custom_weights
)
Field-Specific Matching
# Restrict matching to specific fields
matches = canon.match_entity(
entity_term="John Smith",
metadata_path="metadata.pkl",
schema_path="schema.pkl",
embedding_path="embeddings.npz",
field_filter=["customer_name", "contact_name"],
use_semantic_search=True
)
Features in Detail
Entity Extraction
- Automatic detection of entity fields
- Support for custom entity field selection
- Intelligent handling of name patterns
- Configurable uniqueness thresholds
- Length-based filtering
- spaCy NER integration for complex text
Matching Process
- Semantic pruning (if enabled)
- Multi-strategy scoring
- Weighted combination of scores
- Bonus/penalty application
- Result ranking and filtering
Data Processing
- Schema inference
- Data type detection
- Date format recognition
- Metadata generation
- Entity normalization
- Custom field mapping
Requirements
- Python 3.8+
- See setup.py for full list of dependencies
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file canonmap-0.1.53.tar.gz.
File metadata
- Download URL: canonmap-0.1.53.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc7d48a918dfec39877184fbf71ac9f02a77e46191b3ef54f8eb892a53099b45
|
|
| MD5 |
04b144a3da6c25794d035c0b3b68b737
|
|
| BLAKE2b-256 |
3939a3c28cc8d99fe2d25654b422276c359425bb72b6171cdd9ece8e733d56b7
|
File details
Details for the file canonmap-0.1.53-py3-none-any.whl.
File metadata
- Download URL: canonmap-0.1.53-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ecb34832c4b681bf4adb15a9a06934a8b2ae32fd5d7d6e4e154c706b06ee8ec
|
|
| MD5 |
5680f40431313330c1c8192afc8bfea8
|
|
| BLAKE2b-256 |
ed6e46adb3d459875847b685893be2fb5193723796b3f8d75802c4c4b9230a62
|