Skip to main content

A data matching and canonicalization library with multipl database connector support

Project description

CanonMap

A powerful data matching and canonicalization library with MySQL connector support.

Features

  • Data Matching: Advanced algorithms for fuzzy string matching and record linkage
  • MySQL Integration: Seamless connection and management of MySQL databases
  • Canonicalization: Standardize and normalize data across different formats
  • Rich Logging: Beautiful console output with structured logging
  • FastAPI Support: Optional FastAPI integration for web services

Installation

pip install canonmap

For development dependencies:

pip install canonmap[dev]

For FastAPI support:

pip install canonmap[fastapi]

Quick Start

Command Line Interface

CanonMap provides a CLI tool for quick project setup:

# Create a new API project (default name: app)
cm create-api

# Create a new API project with custom name
cm create-api --name my-api

# Create a new API project with spaces (will be normalized)
cm create-api --name "My API"

The CLI will automatically:

  • Normalize directory names to follow Python conventions
  • Auto-increment names if the directory already exists (app, app-2, app-3, etc.)
  • Copy and customize the example API template
  • Replace all references from "app" to your chosen name
  • Install required dependencies (fastapi, uvicorn, python-dotenv)

Basic Usage

from canonmap import make_console_handler
from canonmap.connectors.mysql_connector import MySQLConnector

# Set up logging
make_console_handler(set_root=True)

# Create a MySQL connector
connector = MySQLConnector(
    host="localhost",
    port=3306,
    user="your_user",
    password="your_password",
    database="your_database"
)

# Use the connector for data operations
# ... your data matching and canonicalization code

Data Matching Example

from canonmap.connectors.mysql_connector.matching import Matcher

# Initialize matcher
matcher = Matcher()

# Perform fuzzy matching
matches = matcher.find_matches(
    source_data=source_records,
    target_data=target_records,
    fields_to_match=["name", "address"],
    threshold=0.8
)

Table Management with Primary Key Support

CanonMap supports creating tables from data with intelligent primary key handling:

import pandas as pd
from canonmap.connectors.mysql_connector import MySQLConnector, TableManager
from canonmap.connectors.mysql_connector.managers.table_manager.validators.requests import CreateTableRequest
from canonmap.connectors.mysql_connector.managers.database_manager.validators.models import Database
from canonmap.connectors.mysql_connector.validators.models import IfExists

# Create connector and table manager
connector = MySQLConnector(config)
table_manager = TableManager(connector)

# Example 1: Create table with user-specified primary key (if valid)
data = pd.DataFrame({
    'user_id': [1001, 1002, 1003, 1004, 1005],  # Unique values
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'email': ['alice@test.com', 'bob@test.com', 'charlie@test.com', 'david@test.com', 'eve@test.com']
})

request = CreateTableRequest(
    database=Database(name="my_database"),
    name="users",
    data=data,
    primary_key_field="user_id",  # Will be validated for uniqueness
    if_exists=IfExists.REPLACE
)

result = table_manager.create_table(request)
# Result: Table created with 'user_id' as PRIMARY KEY

Primary Key Validation Features:

  • Automatic Validation: The system validates that the specified field is unique and contains no null values
  • Smart Fallback: If validation fails, automatically falls back to auto-increment id column
  • Logging: Clear log messages inform you about validation results and fallback decisions
  • Multiple Data Sources: Works with DataFrames, CSV files, lists of dictionaries, and more

Validation Rules:

  • Field must exist in the data
  • Field must contain no null/empty values
  • Field must have unique values across all rows
  • If any rule is violated, falls back to auto-increment id PRIMARY KEY

Documentation

For detailed documentation, visit the project homepage.

Development

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/canonmap.git
cd canonmap
  1. Install development dependencies:
pip install -e ".[dev]"
  1. Run tests:
pytest

Code Quality

This project uses several tools to maintain code quality:

  • Black: Code formatting
  • isort: Import sorting
  • flake8: Linting
  • mypy: Type checking
  • pytest: Testing

Run all quality checks:

black src/ tests/
isort src/ tests/
flake8 src/ tests/
mypy src/
pytest

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for a list of changes and version history.

Support

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonmap-0.4.29.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

canonmap-0.4.29-py3-none-any.whl (67.5 kB view details)

Uploaded Python 3

File details

Details for the file canonmap-0.4.29.tar.gz.

File metadata

  • Download URL: canonmap-0.4.29.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for canonmap-0.4.29.tar.gz
Algorithm Hash digest
SHA256 f2a01390085b3564c889dc4923250cddce18ae1dcdcf07a2dd4635f5e9216fbd
MD5 9cc3bd58e040949ac89b8510868eb31c
BLAKE2b-256 8dd804769e6944746a41df8e0d23318de7ca62f34e1a1cb9f8ec81c25464b05e

See more details on using hashes here.

File details

Details for the file canonmap-0.4.29-py3-none-any.whl.

File metadata

  • Download URL: canonmap-0.4.29-py3-none-any.whl
  • Upload date:
  • Size: 67.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for canonmap-0.4.29-py3-none-any.whl
Algorithm Hash digest
SHA256 486c2e5fb8a5593ade80a747f76bfbf68636661c1074fe0e492d33f9b656ed31
MD5 d09e1b4876bd8a675495e5d141ac3353
BLAKE2b-256 7ea5ba2fcd4cb74fb767f535fd85cd5e8e3e3e5c3dd03977f958faec1e8a3972

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page