Framework to convert Informatica PowerCenter XML exports to PySpark code for Databricks. Auto-detects sources (SQL, CSV, Parquet, XML, JSON, text, DAT, files without extensions) and generates complete deployment packages.
Project description
informatica-sparker
A Python framework that converts Informatica PowerCenter workflow/mapping XML exports into PySpark code deployable to Databricks.
Features
- Multi-Mapping Support: Handles any number of mappings per XML file, generating separate
.pyfiles for each mapping - Auto Source Detection: Automatically identifies source types and connection details:
- SQL databases (SQL Server, Oracle, MySQL, PostgreSQL, DB2, Teradata, Netezza, Sybase, Informix)
- File formats: CSV, Parquet, DAT, XML, JSON, Text, Fixed-Width, Avro, ORC, Excel
- Files without extensions
- JDBC/ODBC connections with driver JAR detection
- Complete Output Package: Generates a full deployment-ready package:
mapping_name.py- PySpark script for each mappingworkflow.py- Workflow orchestration with dependency managementconfig.yml- Unified YAML configuration with environment variable supportall_sql_queries.sql- All extracted SQL queries organized by mappingerror_log.txt- Detailed conversion log with warnings, errors, and source detection results
- Transformation Coverage: Supports Source Qualifier, Expression, Filter, Lookup, Joiner, Aggregator, Sorter, Union, Router, Sequence Generator, Update Strategy, Stored Procedure, Mapplet
- Python 3.10+ Compatible
Installation
pip install informatica-sparker
Quick Start
Command Line
# Convert XML to PySpark
informatica-sparker convert mapping_export.xml -o output_dir
# Analyze XML without converting
informatica-sparker analyze mapping_export.xml
# Analyze with JSON output
informatica-sparker analyze mapping_export.xml --json
# Use custom config
informatica-sparker convert mapping_export.xml -o output_dir -c my_config.yml
Python API
from informatica_sparker import ConversionService, UserConfig
# Basic conversion
service = ConversionService()
result = service.convert_file("mapping_export.xml", output_dir="output")
print(f"Mappings converted: {result.mappings_processed}/{result.mapping_count}")
print(f"Files generated: {len(result.files)}")
print(f"SQL queries found: {len(result.sql_queries)}")
# Check source detections
for detection in result.source_detections:
print(f" {detection.source_name}: {detection.detected_type.value}")
if detection.file_format:
print(f" Format: {detection.file_format.value}")
for note in detection.detection_notes:
print(f" {note}")
# Inspect extracted SQL queries
for query in result.sql_queries:
print(f" [{query.query_type}] {query.step_name}: {query.query[:80]}...")
With Custom Configuration
from informatica_sparker import ConversionService, UserConfig
config = UserConfig(
db_connections={
"source_db": {
"host": "myserver.database.windows.net",
"database": "mydb",
"user": "admin",
"password": "secret",
}
}
)
service = ConversionService(user_config=config)
result = service.convert_file("export.xml", output_dir="spark_output")
Output Structure
output/
mapping_1.py # PySpark code for mapping 1
mapping_2.py # PySpark code for mapping 2
mapping_N.py # PySpark code for mapping N
workflow.py # Workflow orchestration (runs all mappings in order)
config.yml # Unified YAML config (connections, sources, targets)
all_sql_queries.sql # All SQL queries extracted from all mappings
error_log.txt # Conversion log with warnings, errors, detections
Source Type Detection
The framework automatically identifies what each source in the XML is:
| Source Type | Detection Method |
|---|---|
| SQL Server | DATABASETYPE attribute, connection properties |
| Oracle | DATABASETYPE attribute, JDBC driver class |
| Flat File (CSV) | File extension, DATABASETYPE=Flat File, delimiter attributes |
| Parquet | .parquet file extension in source attributes |
| DAT | .dat file extension |
| XML | .xml extension or DATABASETYPE=XML |
| JSON | .json extension or DATABASETYPE=JSON |
| Text | .txt/.text/.log extension |
| No Extension | File source with no recognizable extension |
| Fixed Width | DATABASETYPE=Fixed-Width or file type attribute |
Connection details (JDBC URLs, driver JARs, host/port) are automatically extracted and included in the generated config.yml.
Supported Transformations
| Informatica Transform | PySpark Equivalent |
|---|---|
| Source Qualifier | spark.read.format("jdbc") / spark.read.csv() etc. |
| Expression | .withColumn() / .select() with expressions |
| Filter | .filter() / .where() |
| Lookup | .join() with broadcast hint |
| Joiner | .join() (inner, left, right, full) |
| Aggregator | .groupBy().agg() |
| Sorter | .orderBy() |
| Union | df1.unionByName(df2) |
| Router | Multiple .filter() branches |
| Sequence Generator | monotonically_increasing_id() |
| Update Strategy | Insert/Update/Delete flags |
| Target | .write.format("jdbc") / .write.format("delta") |
Configuration File (config.yml)
The generated config.yml supports environment variable substitution:
spark:
app_name: "my_workflow"
master: "${SPARK_MASTER:local[*]}"
connections:
CDM_PRE_LANDING:
db_type: "sqlserver"
host: "${MSSQL_HOST}"
database: "msscdm_dev"
user: "${MSSQL_USER}"
password: "${MSSQL_PASSWORD}"
driver: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
driver_jar: "${MSSQL_DRIVER_JAR:/opt/drivers/mssql-jdbc.jar}"
Requirements
- Python >= 3.10
- lxml >= 4.9.0
- pydantic >= 2.0.0
- jinja2 >= 3.1.0
- networkx >= 3.0
- pyyaml >= 6.0
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file informatica_sparker-2.0.0.tar.gz.
File metadata
- Download URL: informatica_sparker-2.0.0.tar.gz
- Upload date:
- Size: 78.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9f68065ccc01e6c982fae6e25853725043a4fcd77911e9c0b999217c633b0ff
|
|
| MD5 |
9f7f474c3cde09bc268d70aaa60606bc
|
|
| BLAKE2b-256 |
03b27714d5b0259ed7631fd4380968b5558ae9c9f6a399e3534df6a25dab31a3
|
File details
Details for the file informatica_sparker-2.0.0-py3-none-any.whl.
File metadata
- Download URL: informatica_sparker-2.0.0-py3-none-any.whl
- Upload date:
- Size: 87.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03bd5973c0df5ff27bfa9bfef93cbd3cce63e0dda00ad9195fc9b5dff7138251
|
|
| MD5 |
e34735fcd43a9c1aba308241a7bedaaa
|
|
| BLAKE2b-256 |
d73697b78e976490d30c36e01c4b2c146c96529f95d04c36da5adee86489401a
|