Skip to main content

A package for training and inference of the InterFusion Encoder model

Project description

InterFusion Encoder

InterFusion Encoder is a Python package for training and inference of a cross-encoder model designed to match candidates with jobs using both textual data and optional sparse features. It utilizes state-of-the-art transformer models and incorporates an attention mechanism and interaction layers to enhance performance.

Table of Contents

Features

  • Supports candidate and job features of different lengths.
  • Incorporates both bi-encoder and cross-encoder architectures.
  • Utilizes hard negative sampling and random negatives for robust training.
  • Includes attention mechanisms and interaction layers for improved performance.
  • Supports training continuation from saved checkpoints.
  • Integrated with Weights & Biases (W&B) for experiment tracking.

Installation

Install the package using pip:

pip install interfusion_encoder

Usage

Training

from interfusion import train_model

# Prepare your data
candidates = [
    {
        "candidate_id": "cand_001",
        "candidate_text": "Experienced software engineer...",
        "candidate_features": [0.8, 0.7, 0.9]
    },
    # Add more candidates
]

jobs = [
    {
        "job_id": "job_001",
        "job_text": "Looking for a software engineer...",
        "job_features": [0.85, 0.75, 0.9, 0.95]
    },
    # Add more jobs
]

positive_matches = [
    {
        "candidate_id": "cand_001",
        "job_id": "job_001"
    },
    # Add more positive matches
]

# Define your configuration (optional)
user_config = {
    'use_sparse': True,
    'num_epochs': 5,
    'learning_rate': 3e-5,
    'cross_encoder_model_name': 'bert-base-uncased',
    'bi_encoder_model_name': 'bert-base-uncased',
    'wandb_project': 'interfusion_project',
    'wandb_run_name': 'experiment_1',
    # Add or override other configurations as needed
}

# Start training
train_model(candidates, jobs, positive_matches, user_config=user_config)

Inference

from interfusion import InterFusionInference

# Initialize inference model
config = {
    'use_sparse': True,
    'cross_encoder_model_name': 'bert-base-uncased',
    'saved_model_path': 'saved_models/interfusion_final.pt',
    'candidate_feature_size': 3,  # Set according to your data
    'job_feature_size': 4         # Set according to your data
}
inference_model = InterFusionInference(config=config)

# Prepare candidate and job texts and features
candidate_texts = [
    "Experienced software engineer...",
    # Add more candidate texts
]

job_texts = [
    "Looking for a software engineer...",
    # Add more job texts
]

candidate_features_list = [
    [0.8, 0.7, 0.9],
    # Add more candidate features
]

job_features_list = [
    [0.85, 0.75, 0.9, 0.95],
    # Add more job features
]

# Predict match scores
scores = inference_model.predict(candidate_texts, job_texts, candidate_features_list, job_features_list)

# Print the results
for candidate, job, score in zip(candidate_texts, job_texts, scores):
    print(f"Candidate: {candidate}")
    print(f"Job: {job}")
    print(f"Match Score: {score:.4f}\\n")

Data Preparation

Ensure your data is in the form of lists of dictionaries with the following structure:

Candidates:

[
    {
        "candidate_id": "cand_001",
        "candidate_text": "Candidate description...",
        "candidate_features": [feature_vector]  # Optional
    },
    # Add more candidates
]

Jobs:

[
    {
        "job_id": "job_001",
        "job_text": "Job description...",
        "job_features": [feature_vector]  # Optional
    },
    # Add more jobs
]

Positive Matches:

[
    {
        "candidate_id": "cand_001",
        "job_id": "job_001"
    },
    # Add more matches
]

Configuration

You can customize the model and training parameters by passing a user_config dictionary to the train_model function. Here are some of the configurable parameters:

  • random_seed: Random seed for reproducibility.
  • max_length: Maximum sequence length for tokenization.
  • use_sparse: Whether to use sparse features.
  • bi_encoder_model_name: Pre-trained model name for the bi-encoder.
  • cross_encoder_model_name: Pre-trained model name for the cross-encoder.
  • learning_rate: Learning rate for the optimizer.
  • num_epochs: Number of training epochs.
  • train_batch_size: Batch size for training.
  • wandb_project: W&B project name for logging.
  • saved_model_path: Path to save or load the trained model.

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interfusion_encoder-0.2.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

interfusion_encoder-0.2.0-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file interfusion_encoder-0.2.0.tar.gz.

File metadata

  • Download URL: interfusion_encoder-0.2.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for interfusion_encoder-0.2.0.tar.gz
Algorithm Hash digest
SHA256 80f9264ade4d28dbc10dab5787c5f8b3cfa01b3568b13ca9920152c84c24c9cf
MD5 d49f0fe9cc3f925a9cae3e17b06172e0
BLAKE2b-256 2cc102847f0490ef8282b117d249f48669a2557880b5c0cefdf11e453b76a4a4

See more details on using hashes here.

File details

Details for the file interfusion_encoder-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for interfusion_encoder-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4ed2d7c10d08f7e35881bd699aa81f664994922cc63c7776192876ff0a62e2b
MD5 9ceb3d492fdab96bf2c59638a4a49ad7
BLAKE2b-256 9960106c3b824061ea7ac6d3776a850590f077a1ae4d654fcdda4e256f0fc6f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page