Skip to main content

Collection of custom DataHub transformers for metadata enhancement

Project description

DataHub Custom Transformers

PyPI version Python Support License: MIT

A collection of custom DataHub transformers for various metadata enhancement tasks.

Features

  • 🏗️ Modular Design: Easy to add new transformers
  • 🔧 Production Ready: Tested and documented transformers
  • 🔌 Auto-Discovery: Transformers are automatically registered with DataHub

Installation

uv add datahub-custom-transformers

Available Transformers

Domain Structured Properties Transformer

Adds domain-type structured properties to all datasets in an ingestion.

Use Case: Organizational data classification where all datasets from a source belong to the same environment, team, or department.

transformers:
  - type: "simple_add_dataset_domain_structured_properties"
    config:
      properties:
        environment: "production_environment"
        team: "data_engineering_team"
        department: "engineering_department"

Quick Start

1. Prerequisites

Create structured properties in DataHub:

# structured_properties.yaml
- id: department
  type: urn
  description: "Data environment assignment"
  display_name: "Environment"
  entity_types: [dataset]
  cardinality: SINGLE
  type_qualifier:
    allowed_types: ["urn:li:entityType:datahub.domain"]

Create domain entities:

  • production_environment
  • data_engineering_team

2. Use in Ingestion Recipe

source:
  type: postgres
  config:
    host_port: "localhost:5432"
    database: "analytics_db"

transformers:
  - type: "simple_add_dataset_domain_structured_properties"
    config:
      properties:
        environment: "production_environment"
        team: "data_engineering_team"

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"

3. Run Ingestion

datahub ingest -c config.yaml

Result

All datasets will have structured properties:

{
  "structuredProperties": {
    "properties": [
      {
        "propertyUrn": "urn:li:structuredProperty:environment",
        "values": ["urn:li:domain:production_environment"]
      },
      {
        "propertyUrn": "urn:li:structuredProperty:team",
        "values": ["urn:li:domain:data_engineering_team"]
      }
    ]
  }
}

Supported DataHub Sources

Works with all DataHub sources:

  • BigQuery, Snowflake, PostgreSQL, MySQL, Redshift
  • dbt, Airflow, Kafka, S3
  • And many more...

Requirements

  • Python 3.11+
  • acryl-datahub >= 0.12.0

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add your transformer with tests
  4. Submit a pull request

Support

License

MIT License - see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datahub_custom_transformers-0.2.1.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datahub_custom_transformers-0.2.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file datahub_custom_transformers-0.2.1.tar.gz.

File metadata

File hashes

Hashes for datahub_custom_transformers-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0e93e6055ba4f99f24294c18e510dacc65cb6cb214f792614e8c201d8544909e
MD5 65b5eff2652f46ffe3795f101c427a82
BLAKE2b-256 8e742fe67d5ff94b68bb4ce08b37fd445f64f33ddd55a066eda3c12e1ce66fc7

See more details on using hashes here.

File details

Details for the file datahub_custom_transformers-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for datahub_custom_transformers-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 52c8f4d38998314e5eaebe78f9806e84744a01960d8d9378bf693fecfe44518e
MD5 f41a49054425b93d0399b77b549fccd6
BLAKE2b-256 38219a6444c6618e8113f595fa2ebd9df60d0b60d56f891c013c8e261bc1409c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page