A production-grade Python framework for converting heterogeneous candidate information into canonical profiles.
Project description
Candidate Transformer
An enterprise-grade, production-ready Python framework for canonical candidate normalization. Candidate Transformer processes heterogeneous candidate profiles (resumes, ATS exports, recruiter spreadsheets) into a unified, highly structured canonical candidate dataset.
The framework provides an intelligent entity resolution engine, deterministic deduplication, configurable merge strategies, and an advanced projection engine to serve multiple downstream consumers from a single source of truth. It includes both a production-ready batch CLI (candidate-transformer) and a professional interactive REPL workspace (ctsh).
Table of Contents
- Core Features
- Architecture
- Project Structure
- Interactive Shell (ctsh)
- Command-Line Interface (CLI)
- Getting Started
- Input Files
- Runtime Configuration
- JSON Server Export
- Outputs
- Command Reference
Core Features
- Data Ingestion: Natively parses CSV, ATS JSON, and unstructured Resume text from dozens of heterogeneous sources simultaneously.
- Canonical Candidate Model: A strongly typed, immutable intermediate schema for all candidates.
- Deterministic UUIDv5 IDs: Generates stable, highly reproducible candidate IDs across pipeline runs based on intrinsic identity properties.
- Entity Resolution: Deterministically merges candidate profiles using strict contact and identity heuristics.
- Field Normalization: Automatically normalizes fields including ISO-3166 country codes, E.164 phone numbers, and lowercased emails.
- Merge Engine: Employs priority resolution for scalars, union merges for lists, deep dictionary merges, and deduplicates work experiences, education history, and projects.
- Confidence Scoring: Rigorously evaluates profile completeness and cross-source corroboration to assign confidence scores (0.0 - 1.0).
- Provenance Tracking: Every single merged field inherently tracks the contributing connector, merge strategy, assigned confidence, and a precise timestamp. Skill-level tracking ensures you know exactly where every skill was discovered.
- Configurable Projections: Schema adaptability without code changes. Serve different representations to downstream consumers dynamically.
- Plugin Architecture: A scalable connector registry makes adding custom connectors trivial.
- Persistent Workspaces: Local SQLite/JSON workspaces seamlessly preserve session states across restarts.
- JSON Server: Launch a detached, non-blocking background REST API directly from the REPL to serve your canonical dataset.
- Human-readable UI: Rich terminal table visualizations for in-depth candidate inspection.
Architecture
flowchart LR
A[Data Sources] --> B[Connector Registry]
B --> C[Connectors]
C --> D[Raw Candidate Records]
D --> E[Extraction & Normalization]
E --> F[Entity Resolution]
F --> G[Merge Engine]
G --> H[Canonical Candidate Dataset]
H --> I[Provenance]
H --> J[Projection Engine]
J --> K[CLI Output]
J --> L[ctsh Workspace]
L --> M[Runtime Config Updates]
M --> G
L --> N[Load Additional Sources]
N --> D
Pipeline Lifecycle
Load Sources
↓
Connector Parsing
↓
Normalization
↓
Entity Resolution
↓
Conflict Resolution
↓
Canonical Dataset
↓
Projection Engine
↓
CLI / ctsh / JSON Server
Supported Connectors
| Connector | Input Format |
|---|---|
| recruiter_csv | CSV |
| ats_json | JSON |
| resume_text | Plain Text |
Project Structure
src/
candidate_transformer/
api/ # Public Facades
cli/ # Command definitions and shell REPL
config/ # Pipeline configurations
connectors/ # CSV, JSON, Text Connectors and Registry
domain/ # Core Canonical Models
exceptions/ # Custom error types
export/ # JSON Server and export pipeline
interfaces/ # Base classes and protocols
normalizers/ # Field normalization (e.g., E.164 phones)
pipeline/ # Resolution, Normalization, extraction stages
projection/ # Configurable JSON Projection Engine
strategies/ # Conflict Resolution Strategies
utils/ # Utilities
validation/ # Output validation
configs/ # Configuration and projection JSONs
sample_data/ # Example inputs
tests/ # Test suites
Interactive Shell (ctsh)
ctsh (Candidate Transformer SHell) is the interactive REPL environment bundled with Candidate Transformer. It provides a persistent workspace for loading heterogeneous candidate sources, building canonical datasets, exploring candidates, switching projections, exporting data, managing runtime configuration, and serving the transformed dataset through JSON Server.
It provides an isolated runtime environment for developers and data engineers to experiment with candidate data. With ctsh, you can:
- Utilize persistent workspaces to save, reopen, and continue previous transformation sessions seamlessly.
- Inspect candidates using beautifully formatted Rich UI tables or verbose JSON output.
- Dynamically runtime load additional sources and trigger a runtime rebuild without restarting.
- Instantly switch projections for varied output shapes (e.g., swapping to analytics).
- Manage pipeline configuration.
- Perform one-command HTTP export via JSON Server.
- Achieve a pure no restart workflow for enterprise data engineering.
Command-Line Interface (CLI)
The batch CLI (candidate-transformer) is intended for automation, scripting, CI/CD pipelines, scheduled jobs, and headless execution.
It exposes core commands tailored for non-interactive pipelines:
transform: Executes the data ingestion, canonicalization, and mapping workflow. It applies any specified projections or pipeline configuration over the provided sources.validate: Asserts that a generated JSON profile complies perfectly with a configuration schema.
Getting Started
Installation from GitHub (Development & Source)
git clone https://github.com/suryanandanbabbar/candidate-transformer.git
cd candidate-transformer
python -m venv venv
source venv/bin/activate
pip install -e .
Quick Start
Input Files
Users can place their source datasets anywhere on their local filesystem and reference those absolute or relative paths directly in the CLI or ctsh commands. There is no rigid directory structure required for input data.
Example locations:
sample_data/
├── recruiter.csv
├── ats.json
└── resume.txt
- Start the interactive shell:
ctsh
- Create or open a workspace:
ctsh> workspace new recruitment-q3 ctsh> workspace open recruitment-q3 - Load the three sources from
sample_data:ctsh> load recruiter_csv sample_data/recruiter.csv ctsh> load ats_json sample_data/ats.json ctsh> load resume_text sample_data/resume.txt - Build the canonical dataset:
ctsh> build - Show the first candidate:
ctsh> show 0 - Switch to the analytics projection:
ctsh> project analytics
Batch CLI Example
Run a one-shot batch transformation from the command line:
candidate-transformer transform \
--source recruiter_csv=sample_data/recruiter.csv \
--source ats_json=sample_data/ats.json \
--source resume_text=sample_data/resume.txt
--source <connector>=<filepath> defines the parser and file.
Applying a specific projection:
candidate-transformer transform \
--source recruiter_csv=sample_data/recruiter.csv \
--projection configs/projections/recruiter.json
--projection <path> overrides the output shape without changing pipeline logic.
Command Reference
Data Ingestion
| Command | Description |
|---|---|
load <connector> <file> |
Load a data source through a connector. |
sources |
List loaded sources. |
connectors |
List registered connectors. |
Pipeline & Projections
| Command | Description |
|---|---|
build |
Build the canonical dataset. |
project <projection_name> [index] [--json] |
Run or preview a projection. |
projections |
List available projections. |
stats |
Show dataset statistics. |
status |
Show current workspace status. |
Candidate Inspection
| Command | Description |
|---|---|
show <name|id|index> [--verbose] [--json] |
Inspect a candidate. |
Runtime Configuration
| Command | Description |
|---|---|
config show |
Display runtime configuration. |
config begin |
Start a configuration edit session. |
config set <key> <value> |
Modify a configuration value. |
config apply |
Apply pending configuration changes. |
Persistence & Export
| Command | Description |
|---|---|
save canonical <file> |
Persist the canonical dataset. |
loadcanonical <file> |
Load a saved canonical dataset. |
export <projection_name> <file> |
Export a projection to disk. |
server start |
Launch the background JSON Server. |
server status |
Show JSON Server status. |
server stop |
Stop the background JSON Server. |
Workspace Management
| Command | Description |
|---|---|
workspace new <name> |
Create a new workspace. |
workspace open <name> |
Open an existing workspace. |
workspace list |
List available workspaces. |
workspace delete <name> |
Delete a workspace. |
reset |
Clear the current workspace state. |
Utility
| Command | Description |
|---|---|
history |
Show command history. |
clear |
Clear the terminal screen. |
help |
Display command help. |
exit |
Exit ctsh. |
JSON Server Export
You can export the canonical dataset and launch a non-blocking, persistent REST API directly from the workspace.
ctsh> server start
This will automatically:
- Export the current dataset and analytics payload.
- Launch a fully detached background
json-serverinstance. - Expose the data seamlessly as HTTP endpoints.
Manage the server lifecycle natively within ctsh:
server status: View the running PID, port, and endpoint paths.server stop: Gracefully terminate the background process.
Troubleshooting: If json-server fails to start, ensure it is installed globally via npm install -g json-server.
Runtime Configuration
The interactive ctsh workspace supports true runtime pipeline evolution.
Without restarting the application, users can:
- Modify merge priorities, projection selection, and connector behaviors.
- Edit configurations (e.g., toggling
include_confidenceorinclude_provenance) interactively. - Rebuild the canonical dataset using the updated inputs or settings.
- Switch projections instantly to reformat data on the fly.
This mirrors enterprise data engineering workflows where datasets evolve iteratively.
Example
ctsh> config begin
ctsh> config set include_provenance false
ctsh> config set include_confidence false
ctsh> config apply
ctsh> build
Outputs
Candidate Transformer generates a highly structured ecosystem of output formats optimized for different consumption patterns.
Canonical Dataset
The definitive source of truth generated by the pipeline. Includes:
- Unified and deeply merged profiles
- Deterministic UUIDv5 IDs for stable tracing
- Intelligently normalized contacts (E.164 phones, standard emails)
- Granular, field-level provenance tracking and conflict resolution audit trails
Built-in Projections
The canonical dataset can be dynamically shaped into distinct schemas:
- Minimal: Essential identity fields (Name, Contact).
- Recruiter: Tailored with extensive fields for recruitment evaluations.
- ATS: Strict schema outputs compatible with Applicant Tracking Systems.
- Analytics: Flattened records optimized for downstream data warehouses and BI tools.
Interactive Outputs
Within the ctsh REPL, output is designed for human insight:
- Rich Tables: Beautifully formatted terminal UI for pipeline status and statistics.
- JSON: Raw, machine-readable output dumps (
--json). - Verbose Inspection: Detailed provenance table inspection with localized timestamps.
- Statistics: Granular metrics on pipeline performance, dataset quality, and average confidence.
REST API Outputs
After launching server start, data is served securely over HTTP based on available projections. Every projection placed inside configs/projections/ is automatically published as an HTTP endpoint.
For example, if you have:
configs/projections/
analytics.json
minimal.json
crm.json
ats.json
It automatically becomes:
GET /analyticsGET /minimalGET /crmGET /ats
In addition, the following built-in endpoints are always available:
GET /canonical(orGET /candidatesas an alias)GET /metadata
Persisted Outputs
- Exported projection JSON files
- Saved standalone Canonical Dataset snapshot binaries
- Persisted SQLite/JSON workspace states across sessions
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file profilefusion-1.0.0.tar.gz.
File metadata
- Download URL: profilefusion-1.0.0.tar.gz
- Upload date:
- Size: 57.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5e1878cd43e453e92c6d25d27f3920a760dcb6aa826b45da00d36e152a2c7eb
|
|
| MD5 |
e9f241fc10868c01085e308be862cfa3
|
|
| BLAKE2b-256 |
4dd29daf2768e519f86729f49fb9a4bb710c51c55661cd941af0881eafca763e
|
File details
Details for the file profilefusion-1.0.0-py3-none-any.whl.
File metadata
- Download URL: profilefusion-1.0.0-py3-none-any.whl
- Upload date:
- Size: 60.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b04990a962b767a5269cb5ecdbf08ca95f96ded7f0d7a9233e9b9b60797cb663
|
|
| MD5 |
5f2f506a5a2a1b1a76538635997d8c77
|
|
| BLAKE2b-256 |
f51fc7f1c8fe1b042019942eba618af5b7919b32bde91553eec67515b74cfc51
|