Skip to main content

CanonMap - A Python library for entity canonicalization and mapping

Project description

CanonMap

A Python library for data mapping and canonicalization.

Installation

pip install canonmap

Dependencies

CanonMap requires the following key dependencies:

  • Python 3.8 or higher
  • spaCy and its English language model

The spaCy English language model (en_core_web_sm) will be automatically downloaded when you first use the library. No manual installation is required.

Quick Start

from canonmap import CanonMap

# Initialize the library
canon = CanonMap()

# Generate artifacts from a CSV file
artifacts = canon.generate_artifacts(
    csv_path="path/to/your/data.csv",
    entity_fields=["name", "email"],
    use_other_fields_as_metadata=True
)

# Save artifacts to files
zip_path = canon.save_artifacts(
    artifacts=artifacts,
    output_path="output",
    name="my_data"
)

print(f"Artifacts saved to: {zip_path}")

Detailed Example

Here's a complete example showing how to use the library in a real-world scenario:

from canonmap import CanonMap
import pandas as pd
from pathlib import Path

def process_customer_data(input_csv: str, output_dir: str):
    # Initialize CanonMap
    canon = CanonMap()
    
    # Define the entity fields we want to extract
    entity_fields = [
        "customer_name",
        "email",
        "phone_number",
        "company"
    ]
    
    # Generate artifacts from the CSV
    artifacts = canon.generate_artifacts(
        csv_path=input_csv,
        entity_fields=entity_fields,
        use_other_fields_as_metadata=True,  # Include other columns as metadata
        num_rows=None  # Process all rows
    )
    
    # Create output directory if it doesn't exist
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Save the artifacts
    zip_path = canon.save_artifacts(
        artifacts=artifacts,
        output_path=str(output_path),
        name="customer_data",
        save_metadata=True,
        save_schema=True
    )
    
    # You can also work with the artifacts directly
    metadata = artifacts["metadata"]
    schema = artifacts["schema"]
    
    # Example: Print some statistics
    print(f"Processed {metadata.get('row_count', 0)} rows")
    print(f"Found {len(schema.get('entities', []))} entities")
    
    return zip_path

# Usage
if __name__ == "__main__":
    zip_file = process_customer_data(
        input_csv="customers.csv",
        output_dir="processed_data"
    )
    print(f"Processing complete. Results saved to: {zip_file}")

Features

  • Process CSV files and generate metadata and schema
  • Extract and canonicalize entity fields
  • Map data to standardized formats
  • Save artifacts as JSON files or ZIP archives
  • Configurable processing options

Requirements

  • Python 3.8+
  • See setup.py for full list of dependencies

License

MIT License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canonmap-0.1.43.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

canonmap-0.1.43-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file canonmap-0.1.43.tar.gz.

File metadata

  • Download URL: canonmap-0.1.43.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.43.tar.gz
Algorithm Hash digest
SHA256 95d0773d7757c7c842c8819258a78052fe6d1e47b4a25134795e213475c78018
MD5 1a6d4f29e56d923c5d3193b5593140c0
BLAKE2b-256 f39b5311bc951f0c7e0aadae75edbef8da9e810cf7749dcbb1f54d3482f7d7a8

See more details on using hashes here.

File details

Details for the file canonmap-0.1.43-py3-none-any.whl.

File metadata

  • Download URL: canonmap-0.1.43-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for canonmap-0.1.43-py3-none-any.whl
Algorithm Hash digest
SHA256 dce1a2e147897acd752cac96b550ab7a26d60e5ee8d4302c634a4dc4d98b0203
MD5 c29dd776c1111445c10d456aa5dec6ae
BLAKE2b-256 d2f2fec835e36ddfe2c3e8cc795152ebd303b46504144c26eaecd5d556e474f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page