Skip to main content

This is the source code for paper JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching (published at NLP4HR Workshop EACL 2024 and now available on ArXiv).

Project description

SkillSkape

This is the source code for paper JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching (published at NLP4HR Workshop EACL 2024 and now available on ArXiv).

Getting Started

Start with creating a python 3.7 venv and installing requirements.txt.

Directory

The directory is composed of multiple components :

.
├── SkillSkape ## SkillSkape Dataset   ├── SKILLSKAPE
│   ├── dev.csv
│   ├── test.csv
│   └── train.csv
├── taxonomy ## Used taxonomy   ├── dev_skillspan_emb.pkl
│   └── test_skillspan_emb.pkl
│
├── annotator ## Used to annotate the samples   ├── feedback.py
│   └── feedback_prompt_template.py
│
│
├── generator.py ## Implementation of Jobskape, dataset generation
├── gen_prompt_template.py ## prompt template for Jobskape and SkillSkape generation
│
│
├── models ## Implementation of both models   ├── skillExtract ## In-Context Learning Pipeline      ├── prompt_template.py
│      └── utils.py
│   └── supervised ## Supervised multiclassification model       ├── multilabel_classifier.py
│       ├── run_classifier.sh
│       └── run_inference.sh
│
├── annotation_refinement.ipynb ## Shows how to use the annotator
├── data_generation.ipynb ## Shows how to use JobSkape for data generation
├── dataset_evaluation.py ## Implementation of experiments and metrics
├── experiment.py ## Script to use the ICL pipeline with SkillSkape
├── static
│   └── ...
├── utils    └── ppl_3_sentences_all_skills.csv ## Popularity distribution
│
│
├── README.md
└── requirements.txt 

JobSkape Framework

JobSkape framework

The framework is made of two components, the Skill Generator that produces, for a given taxonomy, set of associated entities, and a Sentence generator that produces a sentence for each of these combinations.

Skill Generator

We present a dataset generation framework to produce datasets for entity matching tasks. The mechanism is displayed in this notebook.

It works with two components. First we generate faithful combinations of entities :

from generator import SkillsGenerator

gen = SkillsGenerator(taxonomy=emb_tax, ## input taxonomy embedded with a precise model 
            taxonomy_is_embedded=True, ## if the taxonomy is not yet embedded, provide a model
            combination_dist=combination_dist, ## combination size distribution
            emb_model=None, ## if not yet embedded, an encoder Mask Language Model
            emb_tokenizer=None, ## related tokenizer
            popularity=F ## popularity distribution
    )

gen_args = {
    "nb_generation" : split_size, # number of samples
    "threshold" : 0.83, # not considering skills that are less than .8 similar
    "beam_size" : 20, # considering 20 skills
    "temperature_pairing": 1, # popularity to be skewed toward popular skills
    "temperature_sample_size": 1,
    "frequency_select": True, # wether we select within the NN acording to frequency
}


generator = gen.balanced_nbred_iter(**gen_args) ## a generator

Then we generate faithful sentence for each entity combination:

datagen = DatasetGenerator(emb_tax=emb_tax,
                           reference_df=None, ## no references, we work in Zero-Shot, you can input demponstration for kNN demonstration retrieval
                           emb_model=word_emb_model ## encoder Mask Language model to get embeddings,
                           emb_tokenizer=word_emb_tokenizer ## associated tokenizer,
                           additional_info=None ## potential additional information that can be added in custom promopt
            )

    generation_args = {
        "skill_generator": combs[:n_samples], 
        "specific_few_shots": False,
        "model": "gpt-3.5",
        "gen_mode": ["PROTO-GEN-A0" for Dense or "PROTO-GEN-A1" for Sparse],
        "autosave": True,
        "autosave_file": f"generated/SKILLSPAN/{split}.json",
        "checkpoints_freq":10
    }
    res = datagen.generate_ds(**generation_args)

Then you generate negative samples:

With excluded taxonomy:

gen = SkillsGenerator(taxonomy=excluded_tax_sp, 
                taxonomy_is_embedded=True,
                combination_dist=combination_dist,
                popularity=F)
    
gen_args = {
        "nb_generation" : 500, # number of samples
        "threshold" : 0.83, # not considering skills that are less than .8 similar
        "beam_size" : 20, # considering 20 skills
        "temperature_pairing": 1, # popularity to be skewed toward popular skills
        "temperature_sample_size": 1,
        "frequency_select": True, # wether we select within the NN acording to frequency
    }


combinations = list(gen.balanced_nbred_iter(**gen_args))


datagen = DatasetGenerator(excluded_tax_sp,
                           None, ## no references, we work in Zero-Shot
                           word_emb_model,
                           word_emb_tokenizer,
                           additional_info)

generation_args = {
        "skill_generator": combinations, 
        "specific_few_shots": False,
        "model": "gpt-3.5",
        "gen_mode": ["PROTO-GEN-A0" for Dense or "PROTO-GEN-A1" for Sparse],
        "autosave": True,
        "autosave_file": f"generated/SKILLSPAN/excluded_tax_samples.json",
        "checkpoints_freq":10
    }
res = datagen.generate_ds(**generation_args)

With no labels :

## put code

SkillSkape Dataset

Models

We propose two models to evaluate our dataset. The baselines are :

  • Annotated : SkillSpan-M : Zhang, Mike, et al. "Skillspan: Hard and soft skill extraction from english job postings." (2022).
  • Synthetic : Decorte : Decorte, Jens-Joris, et al. Extreme Multi-Label Skill Extraction Training using Large Language Models (2023)

Supervised :

supervised
├── multilabel_classifier.py
├── run_classifier.sh
└── run_inference.sh
  • multilabel_classifier.py : implementation of supervised model based on BERT for multiclass classification
  • run_classifier.sh : training script
  • run_inference.sh : inference script

Unsupervised - In Context Learning :

Alt text

Uses an annotated dataset for training set via kNN demonstration retrieval.

Usage :

python experiment.py --metric_fname metric_file.txt --support SKILLSKAPE --bootstrap 5

with arguments :

  • metric_fname : Name of the target metric file (required)
  • support : support set, by default SKILLSPAN
  • bootstrap : Number of bootstrap iterations for the experiment

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jobskape-0.0.2.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jobskape-0.0.2-py3-none-any.whl (46.2 kB view details)

Uploaded Python 3

File details

Details for the file jobskape-0.0.2.tar.gz.

File metadata

  • Download URL: jobskape-0.0.2.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for jobskape-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6dec46a795ca65c613f52fb5529ca30fee0e0bb3cc2d69ee0fd6e7b1c2fe0168
MD5 41dd34cf7c5d3e555be220b0f80b09b2
BLAKE2b-256 f55048f9b16d4beac43bd4c2f72890079cbf0e7e9bdbd3a77ff3853e76d6f791

See more details on using hashes here.

File details

Details for the file jobskape-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: jobskape-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 46.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for jobskape-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 023d4afe4c2805e5e7726cf64caf4c58128d2c77c8583b594255c6eb0519435c
MD5 137964742db0f01db85d54c8771a59a6
BLAKE2b-256 3bb13ecdec8faa22b7f6ae7bc96c533c70254bd07d0170a12301087029498194

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page