This is the source code for paper JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching (published at NLP4HR Workshop EACL 2024 and now available on ArXiv).

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

SkillSkape

This is the source code for paper JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching (published at NLP4HR Workshop EACL 2024 and now available on ArXiv).

Getting Started

Start with creating a python 3.7 venv and installing requirements.txt.

JobSkape Framework

JobSkape framework

The framework is made of two components, the Skill Generator that produces, for a given taxonomy, set of associated entities, and a Sentence generator that produces a sentence for each of these combinations.

Skill Generator

We present a dataset generation framework to produce datasets for entity matching tasks. The mechanism is displayed in this notebook.

It works with two components. First we generate faithful combinations of entities :

from generator import SkillsGenerator

gen = SkillsGenerator(taxonomy=emb_tax, ## input taxonomy embedded with a precise model 
            taxonomy_is_embedded=True, ## if the taxonomy is not yet embedded, provide a model
            combination_dist=combination_dist, ## combination size distribution
            emb_model=None, ## if not yet embedded, an encoder Mask Language Model
            emb_tokenizer=None, ## related tokenizer
            popularity=F ## popularity distribution
    )

gen_args = {
    "nb_generation" : split_size, # number of samples
    "threshold" : 0.83, # not considering skills that are less than .8 similar
    "beam_size" : 20, # considering 20 skills
    "temperature_pairing": 1, # popularity to be skewed toward popular skills
    "temperature_sample_size": 1,
    "frequency_select": True, # wether we select within the NN acording to frequency
}


generator = gen.balanced_nbred_iter(**gen_args) ## a generator

Then we generate faithful sentence for each entity combination:

datagen = DatasetGenerator(emb_tax=emb_tax,
                           reference_df=None, ## no references, we work in Zero-Shot, you can input demponstration for kNN demonstration retrieval
                           emb_model=word_emb_model ## encoder Mask Language model to get embeddings,
                           emb_tokenizer=word_emb_tokenizer ## associated tokenizer,
                           additional_info=None ## potential additional information that can be added in custom promopt
            )

    generation_args = {
        "skill_generator": combs[:n_samples], 
        "specific_few_shots": False,
        "model": "gpt-3.5",
        "gen_mode": ["PROTO-GEN-A0" for Dense or "PROTO-GEN-A1" for Sparse],
        "autosave": True,
        "autosave_file": f"generated/SKILLSPAN/{split}.json",
        "checkpoints_freq":10
    }
    res = datagen.generate_ds(**generation_args)

Then you generate negative samples:

With excluded taxonomy:

gen = SkillsGenerator(taxonomy=excluded_tax_sp, 
                taxonomy_is_embedded=True,
                combination_dist=combination_dist,
                popularity=F)
    
gen_args = {
        "nb_generation" : 500, # number of samples
        "threshold" : 0.83, # not considering skills that are less than .8 similar
        "beam_size" : 20, # considering 20 skills
        "temperature_pairing": 1, # popularity to be skewed toward popular skills
        "temperature_sample_size": 1,
        "frequency_select": True, # wether we select within the NN acording to frequency
    }


combinations = list(gen.balanced_nbred_iter(**gen_args))


datagen = DatasetGenerator(excluded_tax_sp,
                           None, ## no references, we work in Zero-Shot
                           word_emb_model,
                           word_emb_tokenizer,
                           additional_info)

generation_args = {
        "skill_generator": combinations, 
        "specific_few_shots": False,
        "model": "gpt-3.5",
        "gen_mode": ["PROTO-GEN-A0" for Dense or "PROTO-GEN-A1" for Sparse],
        "autosave": True,
        "autosave_file": f"generated/SKILLSPAN/excluded_tax_samples.json",
        "checkpoints_freq":10
    }
res = datagen.generate_ds(**generation_args)

With no labels :

## put code

SkillSkape Dataset

Models

We propose two models to evaluate our dataset. The baselines are :

Annotated : SkillSpan-M : Zhang, Mike, et al. "Skillspan: Hard and soft skill extraction from english job postings." (2022).
Synthetic : Decorte : Decorte, Jens-Joris, et al. Extreme Multi-Label Skill Extraction Training using Large Language Models (2023)

Supervised :

supervised
├── multilabel_classifier.py
├── run_classifier.sh
└── run_inference.sh

multilabel_classifier.py : implementation of supervised model based on BERT for multiclass classification
run_classifier.sh : training script
run_inference.sh : inference script

Unsupervised - In Context Learning :

Alt text

Uses an annotated dataset for training set via kNN demonstration retrieval.

Usage :

python experiment.py --metric_fname metric_file.txt --support SKILLSKAPE --bootstrap 5

with arguments :

metric_fname : Name of the target metric file (required)
support : support set, by default SKILLSPAN
bootstrap : Number of bootstrap iterations for the experiment

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.5

Nov 15, 2024

0.0.4

Nov 15, 2024

0.0.3

Nov 15, 2024

0.0.2

Nov 15, 2024

0.0.1

Nov 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jobskape-0.0.5.tar.gz (44.0 kB view details)

Uploaded Nov 15, 2024 Source

File details

Details for the file jobskape-0.0.5.tar.gz.

File metadata

Download URL: jobskape-0.0.5.tar.gz
Upload date: Nov 15, 2024
Size: 44.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for jobskape-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`472c58427aa298bf9cf5934034f808e648b85b853225f8b3c7a3badae66b32e2`
MD5	`1e0ed9a766eefc0b03e6ca00b9393a29`
BLAKE2b-256	`8cf1636cd8210ce6dd0c74c7b8bb9ed666d755904e6a8f5e3a70167c0eb1ef1d`

See more details on using hashes here.

jobskape 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SkillSkape

Getting Started

Directory

JobSkape Framework

Skill Generator

With excluded taxonomy:

With no labels :

SkillSkape Dataset

Models

Supervised :

Unsupervised - In Context Learning :

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes