CanonMap - A Python library for entity canonicalization and mapping
Project description
CanonMap
A Python library for data mapping and canonicalization.
Installation
pip install canonmap
Dependencies
CanonMap requires the following key dependencies:
- Python 3.8 or higher
- spaCy and its English language model
The spaCy English language model (en_core_web_sm) will be automatically downloaded when you first use the library. No manual installation is required.
Quick Start
from canonmap import CanonMap
# Initialize the library
canon = CanonMap()
# Generate artifacts from a CSV file
artifacts = canon.generate_artifacts(
csv_path="path/to/your/data.csv",
entity_fields=["name", "email"],
use_other_fields_as_metadata=True
)
# Save artifacts to files
zip_path = canon.save_artifacts(
artifacts=artifacts,
output_path="output",
name="my_data"
)
print(f"Artifacts saved to: {zip_path}")
Detailed Example
Here's a complete example showing how to use the library in a real-world scenario:
from canonmap import CanonMap
import pandas as pd
from pathlib import Path
def process_customer_data(input_csv: str, output_dir: str):
# Initialize CanonMap
canon = CanonMap()
# Define the entity fields we want to extract
entity_fields = [
"customer_name",
"email",
"phone_number",
"company"
]
# Generate artifacts from the CSV
artifacts = canon.generate_artifacts(
csv_path=input_csv,
entity_fields=entity_fields,
use_other_fields_as_metadata=True, # Include other columns as metadata
num_rows=None # Process all rows
)
# Create output directory if it doesn't exist
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
# Save the artifacts
zip_path = canon.save_artifacts(
artifacts=artifacts,
output_path=str(output_path),
name="customer_data",
save_metadata=True,
save_schema=True
)
# You can also work with the artifacts directly
metadata = artifacts["metadata"]
schema = artifacts["schema"]
# Example: Print some statistics
print(f"Processed {metadata.get('row_count', 0)} rows")
print(f"Found {len(schema.get('entities', []))} entities")
return zip_path
# Usage
if __name__ == "__main__":
zip_file = process_customer_data(
input_csv="customers.csv",
output_dir="processed_data"
)
print(f"Processing complete. Results saved to: {zip_file}")
Features
- Process CSV files and generate metadata and schema
- Extract and canonicalize entity fields
- Map data to standardized formats
- Save artifacts as JSON files or ZIP archives
- Configurable processing options
Requirements
- Python 3.8+
- See setup.py for full list of dependencies
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file canonmap-0.1.26.tar.gz.
File metadata
- Download URL: canonmap-0.1.26.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5512bbe6406d5c5666ea918170bc4e297c455eaf1dceb367546517f12ba28761
|
|
| MD5 |
d5cf7ff61fca50615ab2146fc08dfb7a
|
|
| BLAKE2b-256 |
db7467158fdf91c9ea795a3d29e7b14ac39684cf828b0653da0a549b92dc09b3
|
File details
Details for the file canonmap-0.1.26-py3-none-any.whl.
File metadata
- Download URL: canonmap-0.1.26-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01868bfd17acafe07d77610b85bf314bcef2067d921ebe9d5c6fb06ab0e53054
|
|
| MD5 |
4083ad73a83442abd340c494def9d2f8
|
|
| BLAKE2b-256 |
e41c7175c2842b426ea3a966474d619866b1fff916126a7e2f092f4e3ac0519b
|