This is the source code for paper JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching (published at NLP4HR Workshop EACL 2024 and now available on ArXiv).
Project description
SkillSkape
This is the source code for paper JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching (published at NLP4HR Workshop EACL 2024 and now available on ArXiv).
Getting Started
Start with creating a python 3.7 venv and installing requirements.txt.
Directory
The directory is composed of multiple components :
.
├── SkillSkape ## SkillSkape Dataset
│ ├── SKILLSKAPE
│ ├── dev.csv
│ ├── test.csv
│ └── train.csv
├── taxonomy ## Used taxonomy
│ ├── dev_skillspan_emb.pkl
│ └── test_skillspan_emb.pkl
│
├── annotator ## Used to annotate the samples
│ ├── feedback.py
│ └── feedback_prompt_template.py
│
│
├── generator.py ## Implementation of Jobskape, dataset generation
├── gen_prompt_template.py ## prompt template for Jobskape and SkillSkape generation
│
│
├── models ## Implementation of both models
│ ├── skillExtract ## In-Context Learning Pipeline
│ │ ├── prompt_template.py
│ │ └── utils.py
│ └── supervised ## Supervised multiclassification model
│ ├── multilabel_classifier.py
│ ├── run_classifier.sh
│ └── run_inference.sh
│
├── annotation_refinement.ipynb ## Shows how to use the annotator
├── data_generation.ipynb ## Shows how to use JobSkape for data generation
├── dataset_evaluation.py ## Implementation of experiments and metrics
├── experiment.py ## Script to use the ICL pipeline with SkillSkape
├── static
│ └── ...
├── utils
│ └── ppl_3_sentences_all_skills.csv ## Popularity distribution
│
│
├── README.md
└── requirements.txt
JobSkape Framework
The framework is made of two components, the Skill Generator that produces, for a given taxonomy, set of associated entities, and a Sentence generator that produces a sentence for each of these combinations.
Skill Generator
We present a dataset generation framework to produce datasets for entity matching tasks. The mechanism is displayed in this notebook.
It works with two components. First we generate faithful combinations of entities :
from generator import SkillsGenerator
gen = SkillsGenerator(taxonomy=emb_tax, ## input taxonomy embedded with a precise model
taxonomy_is_embedded=True, ## if the taxonomy is not yet embedded, provide a model
combination_dist=combination_dist, ## combination size distribution
emb_model=None, ## if not yet embedded, an encoder Mask Language Model
emb_tokenizer=None, ## related tokenizer
popularity=F ## popularity distribution
)
gen_args = {
"nb_generation" : split_size, # number of samples
"threshold" : 0.83, # not considering skills that are less than .8 similar
"beam_size" : 20, # considering 20 skills
"temperature_pairing": 1, # popularity to be skewed toward popular skills
"temperature_sample_size": 1,
"frequency_select": True, # wether we select within the NN acording to frequency
}
generator = gen.balanced_nbred_iter(**gen_args) ## a generator
Then we generate faithful sentence for each entity combination:
datagen = DatasetGenerator(emb_tax=emb_tax,
reference_df=None, ## no references, we work in Zero-Shot, you can input demponstration for kNN demonstration retrieval
emb_model=word_emb_model ## encoder Mask Language model to get embeddings,
emb_tokenizer=word_emb_tokenizer ## associated tokenizer,
additional_info=None ## potential additional information that can be added in custom promopt
)
generation_args = {
"skill_generator": combs[:n_samples],
"specific_few_shots": False,
"model": "gpt-3.5",
"gen_mode": ["PROTO-GEN-A0" for Dense or "PROTO-GEN-A1" for Sparse],
"autosave": True,
"autosave_file": f"generated/SKILLSPAN/{split}.json",
"checkpoints_freq":10
}
res = datagen.generate_ds(**generation_args)
Then you generate negative samples:
With excluded taxonomy:
gen = SkillsGenerator(taxonomy=excluded_tax_sp,
taxonomy_is_embedded=True,
combination_dist=combination_dist,
popularity=F)
gen_args = {
"nb_generation" : 500, # number of samples
"threshold" : 0.83, # not considering skills that are less than .8 similar
"beam_size" : 20, # considering 20 skills
"temperature_pairing": 1, # popularity to be skewed toward popular skills
"temperature_sample_size": 1,
"frequency_select": True, # wether we select within the NN acording to frequency
}
combinations = list(gen.balanced_nbred_iter(**gen_args))
datagen = DatasetGenerator(excluded_tax_sp,
None, ## no references, we work in Zero-Shot
word_emb_model,
word_emb_tokenizer,
additional_info)
generation_args = {
"skill_generator": combinations,
"specific_few_shots": False,
"model": "gpt-3.5",
"gen_mode": ["PROTO-GEN-A0" for Dense or "PROTO-GEN-A1" for Sparse],
"autosave": True,
"autosave_file": f"generated/SKILLSPAN/excluded_tax_samples.json",
"checkpoints_freq":10
}
res = datagen.generate_ds(**generation_args)
With no labels :
## put code
SkillSkape Dataset
Models
We propose two models to evaluate our dataset. The baselines are :
- Annotated : SkillSpan-M : Zhang, Mike, et al. "Skillspan: Hard and soft skill extraction from english job postings." (2022).
- Synthetic : Decorte : Decorte, Jens-Joris, et al. Extreme Multi-Label Skill Extraction Training using Large Language Models (2023)
Supervised :
supervised
├── multilabel_classifier.py
├── run_classifier.sh
└── run_inference.sh
multilabel_classifier.py: implementation of supervised model based on BERT for multiclass classificationrun_classifier.sh: training scriptrun_inference.sh: inference script
Unsupervised - In Context Learning :
Uses an annotated dataset for training set via kNN demonstration retrieval.
Usage :
python experiment.py --metric_fname metric_file.txt --support SKILLSKAPE --bootstrap 5
with arguments :
metric_fname: Name of the target metric file (required)support: support set, by defaultSKILLSPANbootstrap: Number of bootstrap iterations for the experiment
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jobskape-0.0.2.tar.gz.
File metadata
- Download URL: jobskape-0.0.2.tar.gz
- Upload date:
- Size: 44.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dec46a795ca65c613f52fb5529ca30fee0e0bb3cc2d69ee0fd6e7b1c2fe0168
|
|
| MD5 |
41dd34cf7c5d3e555be220b0f80b09b2
|
|
| BLAKE2b-256 |
f55048f9b16d4beac43bd4c2f72890079cbf0e7e9bdbd3a77ff3853e76d6f791
|
File details
Details for the file jobskape-0.0.2-py3-none-any.whl.
File metadata
- Download URL: jobskape-0.0.2-py3-none-any.whl
- Upload date:
- Size: 46.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
023d4afe4c2805e5e7726cf64caf4c58128d2c77c8583b594255c6eb0519435c
|
|
| MD5 |
137964742db0f01db85d54c8771a59a6
|
|
| BLAKE2b-256 |
3bb13ecdec8faa22b7f6ae7bc96c533c70254bd07d0170a12301087029498194
|