Phani's Generic Data Reconciliation and Deduplication Utility
Project description
SAP ↔ Salesforce Data Reconciliation Utility
Reconcile SAP and Salesforce master data at bulk scale (300K–400K records). Produces a 10-tab Excel workbook and an HTML dashboard with KPIs, field-level diffs, fuzzy match candidates, and a prioritised action plan.
Supports multiple entity pairs through config, including:
- Accounts
- Orders
- Order Items
- Any custom SAP ↔ SF dataset with mapped keys and fields
Quick Start
Installation
# Install from PyPI
py -m pip install phani-data-recon
# Upgrade to the latest version
py -m pip install --upgrade phani-data-recon
# Verify installed version
py -m pip show phani-data-recon
Run the Application
Using the CLI command:
# Verify the CLI is available
reconcile-accounts --help
# Run with explicit SAP and Salesforce input files
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv
Using py module (alternative if CLI is not on PATH):
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv
Configuration & Input/Output
Configure input and output folders via config file:
Edit config/rules.yaml to set default paths:
input:
sap:
directory: "input"
file_name: "sap_accounts.csv"
sf:
directory: "input"
file_name: "sf_accounts.csv"
output:
formats: ["excel", "html"]
report:
directory: "output"
file_name: "reconciliation_report"
Then run with config:
reconcile-accounts --config config/rules.yaml
py -m phani_data_recon.cli --config config/rules.yaml
Override output directory at runtime:
reconcile-accounts --config config/rules.yaml --output-dir output/custom_run
py -m phani_data_recon.cli --config config/rules.yaml --output-dir output/custom_run
Validation & Advanced Options
Validate headers and config only (dry-run):
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --dry-run
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --dry-run
Generate only HTML output:
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --formats html
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --formats html
Skip fuzzy matching (faster for large files):
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --no-fuzzy
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --no-fuzzy
Enable verbose logging:
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --verbose
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --verbose
Standalone Dedup Command (Separate from Reconciliation)
Use this command when you only want deduplication and do not want to run the full reconciliation flow. It works for Accounts, Orders, Order Items, and any other entity pair configured in YAML.
You can choose dedup logic per run:
primary: dedup by primary key columnfuzzy: dedup by similarity on one or more fields passed as CLI params
How fields are resolved for each mode:
--dedup-mode primaryuses--sap-primary-key/--sf-primary-keywhen passed; otherwise usesjoin.primary.sap_col/join.primary.sf_colfrom config.--dedup-mode fuzzyuses--sap-fuzzy-field/--sf-fuzzy-fieldwhen passed; otherwise uses defaults from config:fuzzy_match.fields[].sap_colfor SAP andsf_fuzzy_dedup.fields[].colfor SF.- If fuzzy mode is selected and no fuzzy fields resolve for the selected entity, the run fails fast with a clear error.
# Accounts (default rules.yaml)
dedup-records --system both --config config/rules.yaml --output-dir output/dedup
# Orders
dedup-records --system both --config config/rules.orders.example.yaml --output-dir output/dedup
# Order Items
dedup-records --system both --config config/rules.order_items.example.yaml --output-dir output/dedup
# Deduplicate only one side
dedup-records --system sap --sap input/sap_orders.csv --config config/rules.orders.example.yaml --output-dir output/dedup
dedup-records --system sf --sf input/sf_order_items.csv --config config/rules.order_items.example.yaml --output-dir output/dedup
# Fuzzy dedup with explicit entity fields (SF)
dedup-records --system sf --dedup-mode fuzzy --sf-fuzzy-field Name --sf-fuzzy-field WC_Email__c --fuzzy-min-score 85 --config config/rules.yaml --output-dir output/dedup
# Fuzzy dedup with explicit entity fields (SAP)
dedup-records --system sap --dedup-mode fuzzy --sap-fuzzy-field name1 --sap-fuzzy-field smtp_addr --fuzzy-min-score 85 --config config/rules.orders.example.yaml --output-dir output/dedup
# Fuzzy dedup on both systems with entity-specific fields passed from CLI
dedup-records --system both --dedup-mode fuzzy --sap-fuzzy-field Order_Number --sap-fuzzy-field Customer_Name --sf-fuzzy-field OrderNumber --sf-fuzzy-field AccountName --fuzzy-min-score 85 --fuzzy-match-mode weighted --config config/rules.orders.example.yaml --output-dir output/dedup
# Module form (if console script is not on PATH)
py -m phani_data_recon.dedup_cli --system sf --sf input/sf_accounts.csv --config config/rules.yaml --output-dir output/dedup
Output files are generated as:
output/dedup/<entity>_sap_deduped_<run_id>.csvoutput/dedup/<entity>_sap_duplicates_<run_id>.csvoutput/dedup/<entity>_sf_deduped_<run_id>.csvoutput/dedup/<entity>_sf_duplicates_<run_id>.csv
Where <entity> comes from entities.pair in config (slug format), for example:
accounts_sap_deduped_<run_id>.csvorders_sf_duplicates_<run_id>.csv
Entity Command Cookbook
Use this section as a direct command reference for the two primary commands:
- Reconciliation command:
reconcile-accounts - Dedup-only command:
dedup-records
Account Reconciliation + Dedup
# Reconciliation (Accounts)
reconcile-accounts --config config/rules.yaml
# Reconciliation with join-key overrides from CLI (also updates rules.yaml)
reconcile-accounts --config config/rules.yaml --sap-primary-key kunnr --sf-primary-key BP_PowerCerv_Account_Id__c --sap-fallback-key SAP_Unique_ID --sf-fallback-key WC_SAP_Identification__c --enable-fallback-key
# Dedup only (Accounts)
dedup-records --system both --config config/rules.yaml --output-dir output/dedup/accounts
# Dedup (Accounts) using fuzzy mode with explicit SF fields
dedup-records --system sf --dedup-mode fuzzy --sf-fuzzy-field Name --sf-fuzzy-field WC_Email__c --fuzzy-min-score 85 --fuzzy-match-mode any --config config/rules.yaml --output-dir output/dedup/accounts
Order Reconciliation + Dedup
# Reconciliation (Orders)
reconcile-accounts --config config/rules.orders.example.yaml
# Dedup only (Orders)
dedup-records --system both --config config/rules.orders.example.yaml --output-dir output/dedup/orders
# Dedup (Orders) using primary-key override
dedup-records --system both --dedup-mode primary --sap-primary-key SAP_Order_Id --sf-primary-key External_Order_Id__c --config config/rules.orders.example.yaml --output-dir output/dedup/orders
# Dedup (Orders) using fuzzy fields passed per entity
dedup-records --system both --dedup-mode fuzzy --sap-fuzzy-field Order_Number --sap-fuzzy-field Customer_Name --sf-fuzzy-field OrderNumber --sf-fuzzy-field AccountName --fuzzy-min-score 85 --config config/rules.orders.example.yaml --output-dir output/dedup/orders
Order Item Reconciliation + Dedup
# Reconciliation (Order Items)
reconcile-accounts --config config/rules.order_items.example.yaml
# Dedup only (Order Items)
dedup-records --system both --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items
# Dedup (Order Items) using fuzzy fields passed per entity
dedup-records --system both --dedup-mode fuzzy --sap-fuzzy-field Material --sap-fuzzy-field Item_Description --sf-fuzzy-field ProductCode --sf-fuzzy-field Description --fuzzy-min-score 85 --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items
Module Form (No PATH dependency)
# Reconciliation command module form
py -m phani_data_recon.cli --config config/rules.orders.example.yaml
# Dedup command module form
py -m phani_data_recon.dedup_cli --system both --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items
Platform-specific path examples:
Windows:
reconcile-accounts --sap .\input\sap_accounts.csv --sf .\input\sf_accounts.csv
py -m phani_data_recon.cli --config .\config\rules.yaml --dry-run
# macOS
reconcile-accounts --sap ./input/sap_accounts.csv --sf ./input/sf_accounts.csv
python3 -m phani_data_recon.cli --config ./config/rules.yaml --dry-run
If Windows cmd does not recognize reconcile-accounts, add your Python Scripts directory to PATH and reopen cmd:
setx PATH "%PATH%;C:\Users\SeshaphaniBysani\AppData\Local\Python\pythoncore-3.14-64\Scripts"
Then verify:
where reconcile-accounts
reconcile-accounts --help
For local development in this repository, editable install still works:
py -m pip install -e .
Package Usage
# Explicit input files
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv
# Config-driven execution
reconcile-accounts --config config/rules.yaml
# Override only the output directory
reconcile-accounts --config config/rules.yaml --output-dir output/run_2026_05_11
# Generate only HTML output
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --formats html
If the console script is not available on your PATH, use:
py -m phani_data_recon.cli --dry-run
Production verification on Windows cmd:
py -m pip show phani-data-recon
where reconcile-accounts
where dedup-records
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --dry-run
Expected state:
- Installed version should be
1.0.4or later. - If
where reconcile-accountsis empty but module execution works, only PATH needs to be fixed.
Python API
Run reconciliation from another Python application:
from phani_data_recon.api import run_reconciliation
exit_code = run_reconciliation(
sap="input/sap_accounts.csv",
sf="input/sf_accounts.csv",
config="config/rules.yaml",
output_dir="output/api_run",
formats=["excel", "html"],
dry_run=False,
no_fuzzy=False,
verbose=True,
)
print(exit_code)
The API mirrors the CLI behavior and returns a process-style exit code.
Options
--sap Path to SAP accounts CSV (optional if config input.sap is set)
--sf Path to Salesforce accounts CSV (optional if config input.sf is set)
--config Path to rules YAML (default: ./config/rules.yaml, then packaged default)
--output-dir Output directory (default: from config)
--formats excel html (default: both)
--sap-primary-key Override SAP primary join key (join.primary.sap_col)
--sf-primary-key Override SF primary join key (join.primary.sf_col)
--sap-fallback-key Override SAP fallback key (join.fallback.sap_col)
--sf-fallback-key Override SF fallback key (join.fallback.sf_col)
--enable-fallback-key Enable fallback key matching
--disable-fallback-key Disable fallback key matching
--dry-run Validate config + headers only; no report written
--no-fuzzy Skip fuzzy matching (faster for large files)
--verbose Verbose logging
Path resolution precedence:
- If
--sap/--sfare passed, CLI values are used. - If not passed, values are resolved from
config/rules.yamlunderinput.sapandinput.sf. - If join-key override args are passed (
--sap-primary-key,--sf-primary-key, fallback key args), CLI values are used at runtime. - When join-key override args are passed,
rules.yamlis also updated with those keys. - If
--configis not passed, the CLI tries localconfig/rules.yamlfirst and then the packaged default config. - If
--output-diris passed, it overridesoutput.report.directory. - If neither CLI nor config provides paths, the run exits with an input path error.
Example key override run:
reconcile-accounts --config config/rules.yaml --sap-primary-key kunnr --sf-primary-key BP_PowerCerv_Account_Id__c --sap-fallback-key SAP_Unique_ID --sf-fallback-key WC_SAP_Identification__c --enable-fallback-key
Standalone dedup command options:
dedup-records --help
# Common options
--system sap|sf|both
--dedup-mode primary|fuzzy
--sap-primary-key <column>
--sf-primary-key <column>
--sap-fuzzy-field <column> # repeat for multiple fields
--sf-fuzzy-field <column> # repeat for multiple fields
--fuzzy-min-score <0-100>
--fuzzy-match-mode weighted|any
Dedup mode tips:
- Use
primaryfor deterministic duplicate detection by business key. - Use
fuzzywhen entity keys are unreliable and duplicate detection should be based on descriptive fields. - For
fuzzy, pass fields that are meaningful for that entity (for example: Account Name/Email for Accounts, Order Number/Account Name for Orders, Product/Description for Order Items).
Configuration
Edit config/rules.yaml to change:
- Default input files via
input.sapandinput.sf(directory+file_name) - Join key columns (SAP ↔ SF linking fields)
- Fallback-key matching toggle via
join.fallback.enabled(default:false= primary-key-only matching) - Field comparison rules, severity levels, and normalize modes
- Deduplication strategy (
keep_first/keep_last/flag_all) - Fuzzy match threshold and fields
- Output formats and directory
- Output report location/name via
output.report.directory+output.report.file_name
Generic Entity Configuration
This utility is generic. To reconcile other object types (for example SAP Orders ↔ Salesforce Orders or SAP Order Items ↔ Salesforce Order Items), update:
entitieslabels (display text in reports)input.sapandinput.sffile pathsjoin.primary.sap_colandjoin.primary.sf_colfield_mappingsfor the entity-specific columnsfuzzy_match.fieldsfor the entity-specific columns
Example entity labels:
entities:
pair: "Orders"
sap: "SAP Orders"
sf: "Salesforce Orders"
sap_short: "SAP"
sf_short: "SF"
Ready-to-use example configs are included:
config/rules.orders.example.yamlconfig/rules.order_items.example.yaml
Run examples:
# SAP Orders vs Salesforce Orders
reconcile-accounts --config config/rules.orders.example.yaml
# SAP Order Items vs Salesforce Order Items
reconcile-accounts --config config/rules.order_items.example.yaml
# Module form
py -m phani_data_recon.cli --config config/rules.orders.example.yaml
When using the package outside this repository, pass your own config file with --config if you do not want to rely on the packaged defaults.
Output Report Configuration
Use the output.report block in config/rules.yaml to control where reports are written and what base filename is used.
output:
formats: ["excel", "html"]
report:
directory: "output/month_end"
file_name: "customer_reconciliation"
This writes reports under output/month_end/ using customer_reconciliation as the base name, for example:
output/month_end/customer_reconciliation_<run_id>.htmloutput/month_end/customer_reconciliation_<run_id>.xlsx
Rules:
--output-diroverridesoutput.report.directoryoutput.report.file_namesets the report filename prefixoutput.formatsselects Excel, HTML, or both
Example commands:
# Use output settings from config
reconcile-accounts --config config/rules.yaml
# Override only the output directory at runtime
reconcile-accounts --config config/rules.yaml --output-dir output/ad_hoc_run
Config Reference (Input + Join)
input:
sap:
directory: "input"
file_name: "sap_accounts.csv"
sf:
directory: "input"
file_name: "sf_accounts.csv"
join:
primary:
sap_col: "SAP_Unique_ID"
sf_col: "BP_PowerCerv_Account_Id__c"
fallback:
enabled: false
sap_col: "SAP_Unique_ID"
sf_col: "WC_SAP_Identification__c"
output:
formats: ["excel", "html"]
report:
directory: "output"
file_name: "reconciliation_report"
Notes:
- Set
join.fallback.enabled: falsefor strict primary-key-only matching (default). - Set
join.fallback.enabled: trueonly when you explicitly want fallback-key matching.
Report Tabs
| Tab | Content |
|---|---|
| Summary | KPI counts, match rate, exception rate |
| Exact_Matches | Records found in both systems |
| Field_Mismatches | Field-level diffs (CRITICAL / HIGH / INFO) |
| SAP_Only | SAP records missing from Salesforce |
| SF_Only | Salesforce records missing from SAP |
| SAP_Duplicates | Duplicate SAP rows before dedup |
| SF_Duplicates | Duplicate SF rows before dedup |
| Fuzzy_Match_Candidates | Likely-same records not linked by ID |
| Data_Quality_Issues | Null IDs, bad formats, validation failures |
| Action_Plan | P1–P4 prioritised remediation table |
Run Tests
py -m pip install pytest
py -m pytest tests/ -v
Distribution (Business Rollout)
# Build wheel + source distribution
py -m pip install build
py -m build
# Install locally from wheel
py -m pip install dist/phani_data_recon-1.0.4-py3-none-any.whl
If reconcile-accounts is not on PATH, run:
py -m phani_data_recon.cli --dry-run
Published package:
py -m pip install --upgrade phani-data-recon
Legacy script usage inside this repository still works:
python run_reconciliation.py --dry-run
CI Publishing (GitHub Actions)
This repository includes .github/workflows/publish-pypi.yml to publish new releases to PyPI without storing a PyPI API token in GitHub.
One-time PyPI setup:
- In PyPI, open the
phani-data-reconproject settings. - Add a Trusted Publisher for this GitHub repository.
- Set the workflow name to
publish-pypi.yml. - Set the environment name to
pypi.
Release flow:
- Bump the version in
pyproject.toml. - Create a GitHub release or run the workflow manually from the Actions tab.
- The workflow builds
dist/artifacts and publishes them with PyPI trusted publishing.
Notes:
- This workflow uses GitHub OIDC via
id-token: write, so noTWINE_PASSWORDsecret is required in GitHub. - Keep local
twineusage only for manual emergency releases.
Project Structure
reconciliation_project/
├── input/ ← Place source CSVs here
├── config/ ← rules.yaml + schema
├── src/ ← All Python modules
├── templates/ ← Jinja2 HTML template
├── tests/ ← pytest test suite
├── output/ ← Reports generated here
└── run_reconciliation.py
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phani_data_recon-1.0.6.tar.gz.
File metadata
- Download URL: phani_data_recon-1.0.6.tar.gz
- Upload date:
- Size: 46.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f6b18a58d0abca0f56d630fc77928856cd2ec3d3dd1bde8d2f14065f1c7b163
|
|
| MD5 |
6cd72b771b38cb2d5083f611ee14aa52
|
|
| BLAKE2b-256 |
732f31ff004b9192d19b98f8e1fcb7bebf94cf26ed6cb98c44fb6748338dc090
|
Provenance
The following attestation bundles were made for phani_data_recon-1.0.6.tar.gz:
Publisher:
publish-pypi.yml on phanimca/PYTHON_PH_ACCOUNT
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phani_data_recon-1.0.6.tar.gz -
Subject digest:
2f6b18a58d0abca0f56d630fc77928856cd2ec3d3dd1bde8d2f14065f1c7b163 - Sigstore transparency entry: 1508402413
- Sigstore integration time:
-
Permalink:
phanimca/PYTHON_PH_ACCOUNT@1e252030efd1aa141e1e3cd2285ecf0f8199a159 -
Branch / Tag:
refs/tags/v1.0.6 - Owner: https://github.com/phanimca
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@1e252030efd1aa141e1e3cd2285ecf0f8199a159 -
Trigger Event:
release
-
Statement type:
File details
Details for the file phani_data_recon-1.0.6-py3-none-any.whl.
File metadata
- Download URL: phani_data_recon-1.0.6-py3-none-any.whl
- Upload date:
- Size: 42.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f28b548981d0f534bf20eaed90af7b0e61262b39b5dcb17f1cdddf19b4f4f25c
|
|
| MD5 |
10731f505abcdc2f23640b96336e5491
|
|
| BLAKE2b-256 |
d903e39bd0af10b8b91db4f510b40d099bccc24a8aa33aa92c4424284288e94e
|
Provenance
The following attestation bundles were made for phani_data_recon-1.0.6-py3-none-any.whl:
Publisher:
publish-pypi.yml on phanimca/PYTHON_PH_ACCOUNT
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phani_data_recon-1.0.6-py3-none-any.whl -
Subject digest:
f28b548981d0f534bf20eaed90af7b0e61262b39b5dcb17f1cdddf19b4f4f25c - Sigstore transparency entry: 1508402647
- Sigstore integration time:
-
Permalink:
phanimca/PYTHON_PH_ACCOUNT@1e252030efd1aa141e1e3cd2285ecf0f8199a159 -
Branch / Tag:
refs/tags/v1.0.6 - Owner: https://github.com/phanimca
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@1e252030efd1aa141e1e3cd2285ecf0f8199a159 -
Trigger Event:
release
-
Statement type: