CanonMap - A Python library for entity canonicalization and mapping
Project description
CanonMap
CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.
Features
-
Flexible Input Support: Process data from:
- CSV/JSON files
- Directories of data files
- Pandas DataFrames
- Python dictionaries
-
Artifact Generation:
- Generate canonical entity lists
- Create database schemas (supports multiple database types)
- Generate semantic embeddings for entities
- Clean and standardize field names
- Process metadata fields
-
Database Support:
- DuckDB (default)
- SQLite
- BigQuery
- MariaDB
- MySQL
- PostgreSQL
Installation
Lightweight Installation (Core Features Only)
pip install canonmap
Full Installation (Including Embedding Support)
pip install canonmap[embedding]
Note: The lightweight installation includes all core features (GCP integration, file processing, schema generation) but excludes embedding functionality. If you need semantic embeddings, use the full installation with [embedding] extras.
Quick Start
from canonmap import (
CanonMap,
ArtifactGenerationRequest,
EntityField,
SemanticField
)
# Initialize CanonMap
canonmap = CanonMap()
# Configure artifact generation
config = ArtifactGenerationRequest(
input_path="path/to/your/data.csv",
output_path="path/to/output",
source_name="my_source",
table_name="my_table",
entity_fields=[
EntityField(table_name="my_table", field_name="name"),
EntityField(table_name="my_table", field_name="id")
],
semantic_fields=[
SemanticField(table_name="my_table", field_name="description"),
SemanticField(table_name="my_table", field_name="notes")
],
generate_schema=True,
generate_embeddings=True,
generate_semantic_texts=True
)
# Generate artifacts
results = canonmap.generate(config)
Artifact Generation Example
from canonmap import (
CanonMap,
ArtifactGenerationRequest,
EntityField,
SemanticField,
ArtifactGenerationResponse
)
cm = CanonMap()
gen_req = ArtifactGenerationRequest(
input_path="input",
output_path="output",
source_name="football_data",
entity_fields=[
EntityField(table_name="passing", field_name="player"),
EntityField(table_name="rushing", field_name="rusher_name"),
],
semantic_fields=[
SemanticField(table_name="passing", field_name="description"),
SemanticField(table_name="rushing", field_name="notes"),
],
generate_schema=True,
save_processed_data=True,
generate_semantic_texts=True
)
resp: ArtifactGenerationResponse = cm.generate(gen_req)
Entity Mapping Example
from canonmap import (
CanonMap,
EntityMappingRequest,
TableFieldFilter,
EntityMappingResponse
)
cm = CanonMap(artifacts_path="output")
mapping_request = EntityMappingRequest(
entities=["tim brady", "jake alan"],
filters=[
TableFieldFilter(table_name="passing", table_fields=["player"])
],
num_results=3,
)
resp: EntityMappingResponse = cm.map_entities(mapping_request)
Configuration Options
The ArtifactGenerationRequest model provides extensive configuration options:
-
Input/Output:
input_path: Path to data file/directory or DataFrame/dictoutput_path: Directory for generated artifactssource_name: Logical source nametable_name: Logical table name
-
Directory Processing:
recursive: Process subdirectoriesfile_pattern: File matching pattern (e.g., "*.csv")table_name_from_file: Use filename as table name
-
Entity Processing:
entity_fields: List of fields to treat as entitiessemantic_fields: List of fields to extract as individual semantic text filesuse_other_fields_as_metadata: Include non-entity fields as metadata
-
Generation Options:
generate_canonical_entities: Generate entity listgenerate_schema: Generate database schemagenerate_embeddings: Generate semantic embeddingsgenerate_semantic_texts: Generate semantic text files from semantic_fieldssave_processed_data: Save cleaned dataschema_database_type: Target database typeclean_field_names: Standardize field names
Output
The generate() method returns a dictionary containing:
- Generated artifacts
- Paths to saved files
- Schema information (if requested)
- Embeddings (if requested)
- Processed data (if requested)
- Semantic text files (if
semantic_fieldsspecified)
Semantic Text Files
When semantic_fields is specified, CanonMap creates zip files containing individual text files for each non-null semantic field value:
- Single table:
{source}_{table}_semantic_texts.zip - Multiple tables:
{source}_semantic_texts.zip(combined) - File naming:
{table_name}_row_{row_index}_{field_name}.txt - Content: Raw text content from the specified semantic fields
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file canonmap-0.1.193.tar.gz.
File metadata
- Download URL: canonmap-0.1.193.tar.gz
- Upload date:
- Size: 59.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25cb866972b5b65106a402442c77546dfc7181cdf01bcfbceeabd5c242d1f566
|
|
| MD5 |
430aa31e5c9a4efe5632874861ebd2af
|
|
| BLAKE2b-256 |
eebcc94a234ac2ed024f9a1f7815e4b1a01cd7b3076bef322e6a848116ce0dcf
|
File details
Details for the file canonmap-0.1.193-py3-none-any.whl.
File metadata
- Download URL: canonmap-0.1.193-py3-none-any.whl
- Upload date:
- Size: 72.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94e3fccfa86474ac440e5af859611b8698a00db636af321946a3d1dcf6a38435
|
|
| MD5 |
a256f6424ec87563385ab6631ccec648
|
|
| BLAKE2b-256 |
4930102db7e4b22baa225678c4aaa4bb4084225b8df647240ba19a2c60208852
|