Framework to convert Informatica PowerCenter XML exports to PySpark code for Databricks. Auto-detects sources (SQL, CSV, Parquet, XML, JSON, text, DAT, files without extensions) and generates complete deployment packages.

These details have not been verified by PyPI

Project description

informatica-sparker

A Python framework that converts Informatica PowerCenter workflow/mapping XML exports into PySpark code deployable to Databricks.

Features

Multi-Mapping Support: Handles any number of mappings per XML file, generating separate .py files for each mapping
Auto Source Detection: Automatically identifies source types and connection details:
- SQL databases (SQL Server, Oracle, MySQL, PostgreSQL, DB2, Teradata, Netezza, Sybase, Informix)
- File formats: CSV, Parquet, DAT, XML, JSON, Text, Fixed-Width, Avro, ORC, Excel
- Files without extensions
- JDBC/ODBC connections with driver JAR detection
Complete Output Package: Generates a full deployment-ready package:
- mapping_name.py - PySpark script for each mapping
- workflow.py - Workflow orchestration with dependency management
- config.yml - Unified YAML configuration with environment variable support
- all_sql_queries.sql - All extracted SQL queries organized by mapping
- error_log.txt - Detailed conversion log with warnings, errors, and source detection results
Transformation Coverage: Supports Source Qualifier, Expression, Filter, Lookup, Joiner, Aggregator, Sorter, Union, Router, Sequence Generator, Update Strategy, Stored Procedure, Mapplet
Python 3.10+ Compatible

Installation

pip install informatica-sparker

Quick Start

Command Line

# Convert XML to PySpark
informatica-sparker convert mapping_export.xml -o output_dir

# Analyze XML without converting
informatica-sparker analyze mapping_export.xml

# Analyze with JSON output
informatica-sparker analyze mapping_export.xml --json

# Use custom config
informatica-sparker convert mapping_export.xml -o output_dir -c my_config.yml

Python API

from informatica_sparker import ConversionService, UserConfig

# Basic conversion
service = ConversionService()
result = service.convert_file("mapping_export.xml", output_dir="output")

print(f"Mappings converted: {result.mappings_processed}/{result.mapping_count}")
print(f"Files generated: {len(result.files)}")
print(f"SQL queries found: {len(result.sql_queries)}")

# Check source detections
for detection in result.source_detections:
    print(f"  {detection.source_name}: {detection.detected_type.value}")
    if detection.file_format:
        print(f"    Format: {detection.file_format.value}")
    for note in detection.detection_notes:
        print(f"    {note}")

# Inspect extracted SQL queries
for query in result.sql_queries:
    print(f"  [{query.query_type}] {query.step_name}: {query.query[:80]}...")

With Custom Configuration

from informatica_sparker import ConversionService, UserConfig

config = UserConfig(
    db_connections={
        "source_db": {
            "host": "myserver.database.windows.net",
            "database": "mydb",
            "user": "admin",
            "password": "secret",
        }
    }
)

service = ConversionService(user_config=config)
result = service.convert_file("export.xml", output_dir="spark_output")

Output Structure

output/
  mapping_1.py          # PySpark code for mapping 1
  mapping_2.py          # PySpark code for mapping 2
  mapping_N.py          # PySpark code for mapping N
  workflow.py           # Workflow orchestration (runs all mappings in order)
  config.yml            # Unified YAML config (connections, sources, targets)
  all_sql_queries.sql   # All SQL queries extracted from all mappings
  error_log.txt         # Conversion log with warnings, errors, detections

Source Type Detection

The framework automatically identifies what each source in the XML is:

Source Type	Detection Method
SQL Server	`DATABASETYPE` attribute, connection properties
Oracle	`DATABASETYPE` attribute, JDBC driver class
Flat File (CSV)	File extension, `DATABASETYPE=Flat File`, delimiter attributes
Parquet	`.parquet` file extension in source attributes
DAT	`.dat` file extension
XML	`.xml` extension or `DATABASETYPE=XML`
JSON	`.json` extension or `DATABASETYPE=JSON`
Text	`.txt`/`.text`/`.log` extension
No Extension	File source with no recognizable extension
Fixed Width	`DATABASETYPE=Fixed-Width` or file type attribute

Connection details (JDBC URLs, driver JARs, host/port) are automatically extracted and included in the generated config.yml.

Supported Transformations

Informatica Transform	PySpark Equivalent
Source Qualifier	`spark.read.format("jdbc")` / `spark.read.csv()` etc.
Expression	`.withColumn()` / `.select()` with expressions
Filter	`.filter()` / `.where()`
Lookup	`.join()` with broadcast hint
Joiner	`.join()` (inner, left, right, full)
Aggregator	`.groupBy().agg()`
Sorter	`.orderBy()`
Union	`df1.unionByName(df2)`
Router	Multiple `.filter()` branches
Sequence Generator	`monotonically_increasing_id()`
Update Strategy	Insert/Update/Delete flags
Target	`.write.format("jdbc")` / `.write.format("delta")`

Configuration File (config.yml)

The generated config.yml supports environment variable substitution:

spark:
  app_name: "my_workflow"
  master: "${SPARK_MASTER:local[*]}"

connections:
  CDM_PRE_LANDING:
    db_type: "sqlserver"
    host: "${MSSQL_HOST}"
    database: "msscdm_dev"
    user: "${MSSQL_USER}"
    password: "${MSSQL_PASSWORD}"
    driver: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    driver_jar: "${MSSQL_DRIVER_JAR:/opt/drivers/mssql-jdbc.jar}"

Requirements

Python >= 3.10
lxml >= 4.9.0
pydantic >= 2.0.0
jinja2 >= 3.1.0
networkx >= 3.0
pyyaml >= 6.0

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.0

Mar 16, 2026

1.0.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

informatica_sparker-2.0.0.tar.gz (78.7 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

informatica_sparker-2.0.0-py3-none-any.whl (87.8 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file informatica_sparker-2.0.0.tar.gz.

File metadata

Download URL: informatica_sparker-2.0.0.tar.gz
Upload date: Mar 16, 2026
Size: 78.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for informatica_sparker-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b9f68065ccc01e6c982fae6e25853725043a4fcd77911e9c0b999217c633b0ff`
MD5	`9f7f474c3cde09bc268d70aaa60606bc`
BLAKE2b-256	`03b27714d5b0259ed7631fd4380968b5558ae9c9f6a399e3534df6a25dab31a3`

See more details on using hashes here.

File details

Details for the file informatica_sparker-2.0.0-py3-none-any.whl.

File metadata

Download URL: informatica_sparker-2.0.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 87.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for informatica_sparker-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`03bd5973c0df5ff27bfa9bfef93cbd3cce63e0dda00ad9195fc9b5dff7138251`
MD5	`e34735fcd43a9c1aba308241a7bedaaa`
BLAKE2b-256	`d73697b78e976490d30c36e01c4b2c146c96529f95d04c36da5adee86489401a`

See more details on using hashes here.

informatica-sparker 2.0.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Project description

informatica-sparker

Features

Installation

Quick Start

Command Line

Python API

With Custom Configuration

Output Structure

Source Type Detection

Supported Transformations

Configuration File (config.yml)

Requirements

License

Project details

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes