CanonMap - A Python library for entity canonicalization and mapping with enhanced configuration and response models
Project description
CanonMap
CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.
Features
-
Flexible Input Support: Process data from:
- CSV/JSON files
- Directories of data files
- Pandas DataFrames
- Python dictionaries
-
Artifact Generation:
- Generate canonical entity lists
- Create database schemas (supports multiple database types)
- Generate semantic embeddings for entities
- Clean and standardize field names
- Process metadata fields
-
Database Support:
- DuckDB (default)
- SQLite
- BigQuery
- MariaDB
- MySQL
- PostgreSQL
-
Enhanced Configuration:
- Separate configuration for artifacts and embeddings
- Optional GCP integration with bucket management
- Flexible sync strategies for cloud storage
- Comprehensive error handling and logging
- Local-only mode for development and testing
Installation
Lightweight Installation (Core Features Only)
pip install canonmap
Full Installation (Including Embedding Support)
pip install canonmap[embedding]
Note: The lightweight installation includes all core features (GCP integration, file processing, schema generation) but excludes embedding functionality. If you need semantic embeddings, use the full installation with [embedding] extras.
Quick Start
Local-Only Mode (Recommended for Development)
from canonmap import (
CanonMap,
CanonMapArtifactsConfig,
CanonMapEmbeddingConfig,
ArtifactGenerationRequest,
EntityField,
SemanticField
)
# Simple local-only configuration
artifacts_config = CanonMapArtifactsConfig(
artifacts_local_path="./artifacts",
gcs_config=None # No GCS integration
)
embedding_config = CanonMapEmbeddingConfig(
embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
gcs_config=None # No GCS integration
)
# Initialize CanonMap
canonmap = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True
)
# Configure artifact generation
request = ArtifactGenerationRequest(
input_path="path/to/your/data.csv",
source_name="my_source",
table_name="my_table",
entity_fields=[
EntityField(table_name="my_table", field_name="name"),
EntityField(table_name="my_table", field_name="id")
],
semantic_fields=[
SemanticField(table_name="my_table", field_name="description"),
SemanticField(table_name="my_table", field_name="notes")
],
generate_schema=True,
generate_embeddings=True,
generate_semantic_texts=True
)
# Generate artifacts
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")
With GCP Integration
from canonmap import (
CanonMap,
CanonMapGCPConfig,
CanonMapCustomGCSConfig,
CanonMapArtifactsConfig,
CanonMapEmbeddingConfig,
ArtifactGenerationRequest,
EntityField,
SemanticField
)
# 1. Set up base GCP configuration
base_gcp = CanonMapGCPConfig(
gcp_service_account_json_path="path/to/service_account.json",
troubleshooting=False
)
# 2. Configure GCS for artifacts and embeddings
artifacts_gcs = CanonMapCustomGCSConfig(
gcp_config=base_gcp,
bucket_name="your-artifacts-bucket",
bucket_prefix="artifacts/",
auto_create_bucket=True,
sync_strategy="refresh"
)
embedding_gcs = CanonMapCustomGCSConfig(
gcp_config=base_gcp,
bucket_name="your-models-bucket",
bucket_prefix="models/",
auto_create_bucket=True,
sync_strategy="refresh"
)
# 3. Create application-specific configs
artifacts_config = CanonMapArtifactsConfig(
artifacts_local_path="./artifacts",
gcs_config=artifacts_gcs
)
embedding_config = CanonMapEmbeddingConfig(
embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
gcs_config=embedding_gcs
)
# 4. Initialize CanonMap
canonmap = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True,
api_mode=False
)
# 5. Configure artifact generation
request = ArtifactGenerationRequest(
input_path="path/to/your/data.csv",
source_name="my_source",
table_name="my_table",
entity_fields=[
EntityField(table_name="my_table", field_name="name"),
EntityField(table_name="my_table", field_name="id")
],
semantic_fields=[
SemanticField(table_name="my_table", field_name="description"),
SemanticField(table_name="my_table", field_name="notes")
],
generate_schema=True,
generate_embeddings=True,
generate_semantic_texts=True
)
# 6. Generate artifacts
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")
Artifact Generation Example
from canonmap import (
CanonMap,
CanonMapArtifactsConfig,
CanonMapEmbeddingConfig,
ArtifactGenerationRequest,
EntityField,
SemanticField,
ArtifactGenerationResponse
)
# Set up configurations (local-only for this example)
artifacts_config = CanonMapArtifactsConfig(
artifacts_local_path="./artifacts",
gcs_config=None # Local-only mode
)
embedding_config = CanonMapEmbeddingConfig(
embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
embedding_model_local_path="./models",
gcs_config=None # Local-only mode
)
# Initialize CanonMap
cm = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True
)
# Create generation request
gen_req = ArtifactGenerationRequest(
input_path="input",
source_name="football_data",
entity_fields=[
EntityField(table_name="passing", field_name="player"),
EntityField(table_name="rushing", field_name="rusher_name"),
],
semantic_fields=[
SemanticField(table_name="passing", field_name="description"),
SemanticField(table_name="rushing", field_name="notes"),
],
generate_schema=True,
save_processed_data=True,
generate_semantic_texts=True
)
# Generate artifacts
resp: ArtifactGenerationResponse = cm.generate_artifacts(gen_req)
# Access response details
print(f"Status: {resp.status}")
print(f"Generated {len(resp.generated_artifacts)} artifacts")
print(f"Processing time: {resp.processing_stats.processing_time_seconds:.2f} seconds")
Entity Mapping Example
from canonmap import (
CanonMap,
EntityMappingRequest,
TableFieldFilter,
EntityMappingResponse
)
# Initialize CanonMap (reusing configs from above)
cm = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config
)
# Create mapping request
mapping_request = EntityMappingRequest(
entities=["tim brady", "jake alan"],
filters=[
TableFieldFilter(table_name="passing", table_fields=["player"])
],
num_results=3,
)
# Map entities
resp: EntityMappingResponse = cm.map_entities(mapping_request)
# Access mapping results
print(f"Processed {resp.total_entities_processed} entities")
print(f"Found {resp.total_matches_found} matches")
for mapping in resp.mappings:
print(f"\nEntity: {mapping.entity}")
for match in mapping.matches:
print(f" Match: {match.matched_entity} (Score: {match.score:.3f})")
Configuration Options
CanonMapGCPConfig
Base GCP configuration with service account and troubleshooting settings:
gcp_service_account_json_path: Path to GCP service account JSON filetroubleshooting: Enable detailed logging and validation
CanonMapCustomGCSConfig
Bucket-specific configuration extending the base GCP config:
gcp_config: Base GCP configurationbucket_name: GCS bucket namebucket_prefix: Optional prefix for bucket operationsauto_create_bucket: Automatically create bucket if it doesn't existauto_create_bucket_prefix: Automatically create prefix directorysync_strategy: Sync strategy ("none", "missing", "overwrite", "refresh")
CanonMapArtifactsConfig
Configuration for artifact storage and management:
artifacts_local_path: Local directory for artifactsgcs_config: Optional GCS configuration for artifact storagetroubleshooting: Enable troubleshooting mode
CanonMapEmbeddingConfig
Configuration for embedding model management:
embedding_model_hf_name: HuggingFace model nameembedding_model_local_path: Local path for model storagegcs_config: Optional GCS configuration for model storagetroubleshooting: Enable troubleshooting mode
ArtifactGenerationRequest
Comprehensive configuration for artifact generation:
-
Input/Output:
input_path: Path to data file/directory or DataFrame/dictsource_name: Logical source nametable_name: Logical table name
-
Directory Processing:
recursive: Process subdirectoriesfile_pattern: File matching pattern (e.g., "*.csv")table_name_from_file: Use filename as table name
-
Entity Processing:
entity_fields: List of fields to treat as entitiessemantic_fields: List of fields to extract as individual semantic text filesuse_other_fields_as_metadata: Include non-entity fields as metadata
-
Generation Options:
generate_canonical_entities: Generate entity listgenerate_schema: Generate database schemagenerate_embeddings: Generate semantic embeddingsgenerate_semantic_texts: Generate semantic text files from semantic_fieldssave_processed_data: Save cleaned dataschema_database_type: Target database typeclean_field_names: Standardize field names
Response Models
ArtifactGenerationResponse
Comprehensive response containing:
status: Success/failure statusmessage: Human-readable messagegenerated_artifacts: List of generated artifacts with metadataprocessing_stats: Detailed processing statisticserrors: List of errors encounteredwarnings: List of warningsgcp_upload_info: GCP upload details- Convenience paths for common artifacts
EntityMappingResponse
Detailed mapping results including:
status: Success/failure statusmappings: List of entity mappings with matchestotal_entities_processed: Number of entities processedtotal_matches_found: Total number of matches foundprocessing_stats: Performance metricsconfiguration_summary: Request configuration summaryerrors: List of errors encounteredwarnings: List of warnings
API Mode
For API deployments, initialize CanonMap with api_mode=True:
canonmap = CanonMap(
artifacts_config=artifacts_config,
embedding_config=embedding_config,
verbose=True,
api_mode=True # Enables API-specific optimizations
)
Output
The generate_artifacts() method returns an ArtifactGenerationResponse containing:
- Generated artifacts with metadata
- Processing statistics and timing information
- Error and warning information
- GCP upload details (if applicable)
- Convenience paths to common artifacts
Semantic Text Files
When semantic_fields is specified, CanonMap creates zip files containing individual text files for each non-null semantic field value:
- Single table:
{source}_{table}_semantic_texts.zip - Multiple tables:
{source}_semantic_texts.zip(combined) - File naming:
{table_name}_row_{row_index}_{field_name}.txt - Content: Raw text content from the specified semantic fields
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file canonmap-0.2.25.tar.gz.
File metadata
- Download URL: canonmap-0.2.25.tar.gz
- Upload date:
- Size: 52.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8ef5834c92fc1d9dff83d4c4a7c6bee8fc81d2b2e527475a80e47251b5413e7
|
|
| MD5 |
a899b11bab0b42799ffafa6be67dadea
|
|
| BLAKE2b-256 |
ea4e366f8c8b8ff3e5df39b7ab0b64d3cddfd71a98a19d7386c40f0b7c72a2c2
|
File details
Details for the file canonmap-0.2.25-py3-none-any.whl.
File metadata
- Download URL: canonmap-0.2.25-py3-none-any.whl
- Upload date:
- Size: 63.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6be0d38b322afb3041da76740a4cccc7a2dd06afe6a4f853a9e1d4fc413f1e8f
|
|
| MD5 |
b9350a7bdea15d95462db97cf6db508c
|
|
| BLAKE2b-256 |
2e091450379f03e00c880c390bc89bc2468c6b92c48ba19db7aeb2ec7a03f115
|