Skip to main content

CanonMap - A Python library for entity canonicalization and mapping

Project description

CanonMap

CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.

Features

  • Flexible Input Support: Process data from:

    • CSV/JSON files
    • Directories of data files
    • Pandas DataFrames
    • Python dictionaries
  • Artifact Generation:

    • Generate canonical entity lists
    • Create database schemas (supports multiple database types)
    • Generate semantic embeddings for entities
    • Clean and standardize field names
    • Process metadata fields
  • Database Support:

    • DuckDB (default)
    • SQLite
    • BigQuery
    • MariaDB
    • MySQL
    • PostgreSQL

Installation

Lightweight Installation (Core Features Only)

pip install canonmap

Full Installation (Including Embedding Support)

pip install canonmap[embedding]

Note: The lightweight installation includes all core features (GCP integration, file processing, schema generation) but excludes embedding functionality. If you need semantic embeddings, use the full installation with [embedding] extras.

Quick Start

from canonmap import (
    CanonMap, 
    ArtifactGenerationRequest, 
    EntityField,
    SemanticField
)

# Initialize CanonMap
canonmap = CanonMap()

# Configure artifact generation
config = ArtifactGenerationRequest(
    input_path="path/to/your/data.csv",
    output_path="path/to/output",
    source_name="my_source",
    table_name="my_table",
    entity_fields=[
        EntityField(table_name="my_table", field_name="name"),
        EntityField(table_name="my_table", field_name="id")
    ],
    semantic_fields=[
        SemanticField(table_name="my_table", field_name="description"),
        SemanticField(table_name="my_table", field_name="notes")
    ],
    generate_schema=True,
    generate_embeddings=True,
    generate_semantic_texts=True
)

# Generate artifacts
results = canonmap.generate(config)

Artifact Generation Example

from canonmap import (
    CanonMap,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField,
    ArtifactGenerationResponse
)

cm = CanonMap()

gen_req = ArtifactGenerationRequest(
    input_path="input",
    output_path="output",
    source_name="football_data",
    entity_fields=[
        EntityField(table_name="passing", field_name="player"),
        EntityField(table_name="rushing", field_name="rusher_name"),
    ],
    semantic_fields=[
        SemanticField(table_name="passing", field_name="description"),
        SemanticField(table_name="rushing", field_name="notes"),
    ],
    generate_schema=True,
    save_processed_data=True,
    generate_semantic_texts=True
)

resp: ArtifactGenerationResponse = cm.generate(gen_req)

Entity Mapping Example

from canonmap import (
    CanonMap,
    EntityMappingRequest,
    TableFieldFilter,
    EntityMappingResponse
)

cm = CanonMap(artifacts_path="output")

mapping_request = EntityMappingRequest(
    entities=["tim brady", "jake alan"],
    filters=[
        TableFieldFilter(table_name="passing", table_fields=["player"])
    ],
    num_results=3,
)

resp: EntityMappingResponse = cm.map_entities(mapping_request)

Configuration Options

The ArtifactGenerationRequest model provides extensive configuration options:

  • Input/Output:

    • input_path: Path to data file/directory or DataFrame/dict
    • output_path: Directory for generated artifacts
    • source_name: Logical source name
    • table_name: Logical table name
  • Directory Processing:

    • recursive: Process subdirectories
    • file_pattern: File matching pattern (e.g., "*.csv")
    • table_name_from_file: Use filename as table name
  • Entity Processing:

    • entity_fields: List of fields to treat as entities
    • semantic_fields: List of fields to extract as individual semantic text files
    • use_other_fields_as_metadata: Include non-entity fields as metadata
  • Generation Options:

    • generate_canonical_entities: Generate entity list
    • generate_schema: Generate database schema
    • generate_embeddings: Generate semantic embeddings
    • generate_semantic_texts: Generate semantic text files from semantic_fields
    • save_processed_data: Save cleaned data
    • schema_database_type: Target database type
    • clean_field_names: Standardize field names

Output

The generate() method returns a dictionary containing:

  • Generated artifacts
  • Paths to saved files
  • Schema information (if requested)
  • Embeddings (if requested)
  • Processed data (if requested)
  • Semantic text files (if semantic_fields specified)

Semantic Text Files

When semantic_fields is specified, CanonMap creates zip files containing individual text files for each non-null semantic field value:

  • Single table: {source}_{table}_semantic_texts.zip
  • Multiple tables: {source}_semantic_texts.zip (combined)
  • File naming: {table_name}_row_{row_index}_{field_name}.txt
  • Content: Raw text content from the specified semantic fields

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonmap-0.1.189.tar.gz (57.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

canonmap-0.1.189-py3-none-any.whl (71.3 kB view details)

Uploaded Python 3

File details

Details for the file canonmap-0.1.189.tar.gz.

File metadata

  • Download URL: canonmap-0.1.189.tar.gz
  • Upload date:
  • Size: 57.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.189.tar.gz
Algorithm Hash digest
SHA256 7a90d96b9d9fefa6e570fd06275c153f9aadf6a60b10358539131b36cb0b57e3
MD5 46f71fb01a2bb908bbc02eb0ee74e017
BLAKE2b-256 ea09813b7a000ee071dad9fc1ab7f96f9b0ba0be769eb12273d938f57ff37ee7

See more details on using hashes here.

File details

Details for the file canonmap-0.1.189-py3-none-any.whl.

File metadata

  • Download URL: canonmap-0.1.189-py3-none-any.whl
  • Upload date:
  • Size: 71.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.189-py3-none-any.whl
Algorithm Hash digest
SHA256 54afc08fdcb8cbad7a38ca65e5eb856c78e0efb3e88de2e1fc1667e351ab9852
MD5 f8d161b513d72bc3f670c8a53299ac87
BLAKE2b-256 6484fb04ad0bca5bf91c9a4674249981fd32dbd480903c92f140875b73d6992c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page