Skip to main content

CanonMap - A Python library for entity canonicalization and mapping

Project description

CanonMap

A powerful Python library for intelligent entity matching and data canonicalization. CanonMap uses advanced techniques to identify, match, and standardize entities across your datasets.

Key Features

  • Multi-strategy Entity Matching: Combines multiple matching strategies for robust entity identification:

    • Semantic matching (45%): Uses transformer embeddings for understanding meaning
    • Fuzzy matching (35%): Handles typos and variations
    • Initial matching (10%): Matches abbreviations and initials
    • Keyword matching (5%): Matches individual words
    • Phonetic matching (5%): Sound-based matching using Double Metaphone
  • Smart Scoring System: Sophisticated scoring with bonus points for:

    • High semantic + fuzzy score combinations (+10 points)
    • Perfect initial matches (+10 points)
    • Perfect phonetic matches (+5 points)
    • Penalties for mismatched high fuzzy/low semantic scores (-15 points)
  • Intelligent Entity Extraction:

    • Automatic entity detection using spaCy NER
    • Smart handling of name fields and patterns
    • Configurable uniqueness ratios and length thresholds
    • Support for both manual field selection and automatic extraction
  • Data Processing:

    • CSV file processing with schema inference
    • Metadata generation and management
    • Entity normalization and standardization
    • Support for custom field mapping

Installation

pip install canonmap

Dependencies

  • Python 3.8 or higher
  • spaCy and its English language model (automatically downloaded on first use)

Quick Start

from canonmap import CanonMap

# Initialize the library
canon = CanonMap()

# Generate artifacts from a CSV file
artifacts = canon.generate_artifacts(
    csv_path="path/to/your/data.csv",
    entity_fields=["name", "email"],
    use_other_fields_as_metadata=True
)

# Save artifacts to files
zip_path = canon.save_artifacts(
    artifacts=artifacts,
    output_path="output",
    name="my_data"
)

# Match entities against your data
matches = canon.match_entity(
    query="John Smith",
    metadata_path="output/metadata.pkl",
    schema_path="output/schema.pkl",
    embedding_path="output/embeddings.npz",
    top_k=5,
    threshold=80.0,
    user_semantic_search=True
)

# Process results
for match in matches:
    print(f"Entity: {match['entity']}")
    print(f"Score: {match['score']}")
    print(f"Metadata: {match['metadata']}")
    print("---")

Advanced Usage

Custom Matching Weights

# Customize the matching strategy weights
custom_weights = {
    'semantic': 0.50,  # Increase semantic matching importance
    'fuzzy': 0.30,     # Decrease fuzzy matching
    'initial': 0.10,   # Keep initial matching
    'keyword': 0.05,   # Keep keyword matching
    'phonetic': 0.05   # Keep phonetic matching
}

matches = canon.match_entity(
    query="John Smith",
    metadata_path="metadata.pkl",
    schema_path="schema.pkl",
    weights=custom_weights
)

Field-Specific Matching

# Restrict matching to specific fields
matches = canon.match_entity(
    query="John Smith",
    metadata_path="metadata.pkl",
    schema_path="schema.pkl",
    field_filter=["customer_name", "contact_name"]
)

Features in Detail

Entity Extraction

  • Automatic detection of entity fields
  • Support for custom entity field selection
  • Intelligent handling of name patterns
  • Configurable uniqueness thresholds
  • Length-based filtering
  • spaCy NER integration for complex text

Matching Process

  1. Semantic pruning (if enabled)
  2. Multi-strategy scoring
  3. Weighted combination of scores
  4. Bonus/penalty application
  5. Result ranking and filtering

Data Processing

  • Schema inference
  • Data type detection
  • Date format recognition
  • Metadata generation
  • Entity normalization
  • Custom field mapping

Requirements

  • Python 3.8+
  • See setup.py for full list of dependencies

License

MIT License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonmap-0.1.44.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

canonmap-0.1.44-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file canonmap-0.1.44.tar.gz.

File metadata

  • Download URL: canonmap-0.1.44.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.44.tar.gz
Algorithm Hash digest
SHA256 59af13cee8e0426c1077281b322990ef8d8b8ed6124da3a1487224e3f09759f3
MD5 a026c4dececdc6e4396a31ea4d267937
BLAKE2b-256 79ee47e4b47f18816bb8f965f2fb8e8b269b2eaecb76658b31587b40d80144af

See more details on using hashes here.

File details

Details for the file canonmap-0.1.44-py3-none-any.whl.

File metadata

  • Download URL: canonmap-0.1.44-py3-none-any.whl
  • Upload date:
  • Size: 26.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.44-py3-none-any.whl
Algorithm Hash digest
SHA256 96cbf222cd6ae7fc504c8c25d1ab2b280cfda477b3f6ca4e42ed37e89c574368
MD5 e94b8a48e9c3a09fca4022f420b56ac5
BLAKE2b-256 b902788ee61b9fac237502b1204add51c576dc5f78fd7cb81fd411d7099bd449

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page