Skip to main content

CanonMap - A Python library for entity canonicalization and mapping

Project description

CanonMap

CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.

Features

  • Flexible Input Support: Process data from:

    • CSV/JSON files
    • Directories of data files
    • Pandas DataFrames
    • Python dictionaries
  • Artifact Generation:

    • Generate canonical entity lists
    • Create database schemas (supports multiple database types)
    • Generate semantic embeddings for entities
    • Clean and standardize field names
    • Process metadata fields
  • Database Support:

    • DuckDB (default)
    • SQLite
    • BigQuery
    • MariaDB
    • MySQL
    • PostgreSQL

Installation

pip install canonmap

Quick Start

from canonmap import CanonMap
from canonmap.models.artifact_generation_request import ArtifactGenerationRequest, EntityField

# Initialize CanonMap
canonmap = CanonMap()

# Configure artifact generation
config = ArtifactGenerationRequest(
    input_path="path/to/your/data.csv",
    output_path="path/to/output",
    source_name="my_source",
    table_name="my_table",
    entity_fields=[
        EntityField(table_name="my_table", field_name="name"),
        EntityField(table_name="my_table", field_name="id")
    ],
    semantic_fields=["description", "notes"],
    generate_schema=True,
    generate_embeddings=True
)

# Generate artifacts
results = canonmap.generate(config)

Artifact Generation Example

from canonmap import (
    CanonMap,
    ArtifactGenerationRequest,
    EntityField,
    ArtifactGenerationResponse
)

cm = CanonMap()

gen_req = ArtifactGenerationRequest(
    input_path="input",
    output_path="output",
    source_name="football_data",
    entity_fields=[
        EntityField(table_name="passing", field_name="player"),
        EntityField(table_name="rushing", field_name="rusher_name"),
    ],
    generate_schema=True,
    save_processed_data=True,
)

resp: ArtifactGenerationResponse = cm.generate(gen_req)

Entity Mapping Example

from canonmap import (
    CanonMap,
    EntityMappingRequest,
    TableFieldFilter,
    EntityMappingResponse
)

cm = CanonMap(artifacts_path="output")

mapping_request = EntityMappingRequest(
    entities=["tim brady", "jake alan"],
    filters=[
        TableFieldFilter(table_name="passing", table_fields=["player"])
    ],
    num_results=3,
)

resp: EntityMappingResponse = cm.map_entities(mapping_request)

Configuration Options

The ArtifactGenerationRequest model provides extensive configuration options:

  • Input/Output:

    • input_path: Path to data file/directory or DataFrame/dict
    • output_path: Directory for generated artifacts
    • source_name: Logical source name
    • table_name: Logical table name
  • Directory Processing:

    • recursive: Process subdirectories
    • file_pattern: File matching pattern (e.g., "*.csv")
    • table_name_from_file: Use filename as table name
  • Entity Processing:

    • entity_fields: List of fields to treat as entities
    • semantic_fields: Fields for semantic embeddings
    • use_other_fields_as_metadata: Include non-entity fields as metadata
  • Generation Options:

    • generate_canonical_entities: Generate entity list
    • generate_schema: Generate database schema
    • generate_embeddings: Generate semantic embeddings
    • save_processed_data: Save cleaned data
    • schema_database_type: Target database type
    • clean_field_names: Standardize field names

Output

The generate() method returns a dictionary containing:

  • Generated artifacts
  • Paths to saved files
  • Schema information (if requested)
  • Embeddings (if requested)
  • Processed data (if requested)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonmap-0.1.146.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

canonmap-0.1.146-py3-none-any.whl (41.9 kB view details)

Uploaded Python 3

File details

Details for the file canonmap-0.1.146.tar.gz.

File metadata

  • Download URL: canonmap-0.1.146.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.146.tar.gz
Algorithm Hash digest
SHA256 dd0640f1b52bb90bc156868084c63c16a8761cdf117bde4f61ee55bbf9306e73
MD5 b7f5b037ca327d167443cbfeb81d8a37
BLAKE2b-256 d37b423097f73110a4700170820f67546b13bb71e0108a1f49df82a830a1adfd

See more details on using hashes here.

File details

Details for the file canonmap-0.1.146-py3-none-any.whl.

File metadata

  • Download URL: canonmap-0.1.146-py3-none-any.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.146-py3-none-any.whl
Algorithm Hash digest
SHA256 f7bebf7473ff0300034d6396f73c7a7304d180a2e2d7d1047e444508ed6c78d8
MD5 8c0bcccba07cf845daa034e33ba6cac8
BLAKE2b-256 d7ec3aca1da4e172c0d75f1fce6b49687c62525c5b6287b073ba8764497b9ed9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page