CanonMap - A Python library for entity canonicalization and mapping
Project description
CanonMap
CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.
Features
-
Flexible Input Support: Process data from:
- CSV/JSON files
- Directories of data files
- Pandas DataFrames
- Python dictionaries
-
Artifact Generation:
- Generate canonical entity lists
- Create database schemas (supports multiple database types)
- Generate semantic embeddings for entities
- Clean and standardize field names
- Process metadata fields
-
Database Support:
- DuckDB (default)
- SQLite
- BigQuery
- MariaDB
- MySQL
- PostgreSQL
Installation
pip install canonmap
Quick Start
from canonmap import CanonMap
from canonmap.models.artifact_generation_request import ArtifactGenerationRequest, EntityField
# Initialize CanonMap
canonmap = CanonMap()
# Configure artifact generation
config = ArtifactGenerationRequest(
input_path="path/to/your/data.csv",
output_path="path/to/output",
source_name="my_source",
table_name="my_table",
entity_fields=[
EntityField(table_name="my_table", field_name="name"),
EntityField(table_name="my_table", field_name="id")
],
semantic_fields=["description", "notes"],
generate_schema=True,
generate_embeddings=True
)
# Generate artifacts
results = canonmap.generate(config)
Configuration Options
The ArtifactGenerationRequest model provides extensive configuration options:
-
Input/Output:
input_path: Path to data file/directory or DataFrame/dictoutput_path: Directory for generated artifactssource_name: Logical source nametable_name: Logical table name
-
Directory Processing:
recursive: Process subdirectoriesfile_pattern: File matching pattern (e.g., "*.csv")table_name_from_file: Use filename as table name
-
Entity Processing:
entity_fields: List of fields to treat as entitiessemantic_fields: Fields for semantic embeddingsuse_other_fields_as_metadata: Include non-entity fields as metadata
-
Generation Options:
generate_canonical_entities: Generate entity listgenerate_schema: Generate database schemagenerate_embeddings: Generate semantic embeddingssave_processed_data: Save cleaned dataschema_database_type: Target database typeclean_field_names: Standardize field names
Output
The generate() method returns a dictionary containing:
- Generated artifacts
- Paths to saved files
- Schema information (if requested)
- Embeddings (if requested)
- Processed data (if requested)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file canonmap-0.1.91.tar.gz.
File metadata
- Download URL: canonmap-0.1.91.tar.gz
- Upload date:
- Size: 36.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dbf0c2c5f7fc872d21a6e60f5b82d8956be1317963862259e7635445ea866fe
|
|
| MD5 |
c36b670d03ee20813b0d5e11f153adf1
|
|
| BLAKE2b-256 |
c5c8fd6f35eeeace7398124aafdf4c080782e1a7eb1b29cb9609da214ee3dd84
|
File details
Details for the file canonmap-0.1.91-py3-none-any.whl.
File metadata
- Download URL: canonmap-0.1.91-py3-none-any.whl
- Upload date:
- Size: 42.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
848c1202cc4cef2288f689f2cf29d9aece11d5a0c9593fd924a3c774092b22c5
|
|
| MD5 |
0e825c24cee22d29ba90f966e282c945
|
|
| BLAKE2b-256 |
a0ad146c9ed9d672c2d61b68f728bde99ec2939399733c538d01f9f856e38e63
|