Skip to main content

A tool to use job text, such as job description, to assign standard occupational classification codes.

Project description

occupationcoder

https://img.shields.io/pypi/v/occupationcoder.svg https://img.shields.io/travis/aeturrell/occupationcoder.svg Documentation Status

A tool to assign standard occupational classification codes to job vacancy descriptions

Given a job title, job description, and job sector the algorithm assigns a UK 3-digit standard occupational classification (SOC) code to the job. The algorithm uses the SOC 2010 standard, more details of which can be found on the ONS’ website.

This code originally written by Jyldyz Djumalieva, Arthur Turrell, David Copple, James Thurgood, and Bradley Speigner. If you use this code please cite:

Turrell, A., Speigner, B., Djumalieva, J., Copple, D., & Thurgood, J. (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings (No. w25837). National Bureau of Economic Research.

@techreport{turrell2019transforming,
  title={Transforming naturally occurring text data into economic statistics: The case of online job vacancy postings},
  author={Turrell, Arthur and Speigner, Bradley and Djumalieva, Jyldyz and Copple, David and Thurgood, James},
  year={2019},
  institution={National Bureau of Economic Research}
}

Pre-requisites

See setup.py for a full list of Python packages.

occupationcoder is built on top of NLTK and uses ‘Wordnet’ (a corpora, number 82 on their list) and the Punkt Tokenizer Models (number 106 on their list). When the coder is run, it will expect to find these in their usual directories. If you have nltk installed, you can get them corpora using nltk.download() which will install them in the right directories or you can go to http://www.nltk.org/nltk_data/ to download them manually (and follow the install instructions).

A couple of the other packages, such as fuzzywuzzy do not come with the Anaconda distribution of Python. You can install these via pip (if you have access to the internet) or download the relevant binaries and install them manually.

File and folder description

  • occupationcoder/coder applies SOC codes to job descriptions

  • occupationcoder/createdictionaries turns the ONS’ index of SOC code into dictionaries used by occupationcoder/coder

  • occupationcoder/dictionaries contains the dictionaries used by occupationcoder/coder

  • occupationcoder/outputs is the default output directory

  • occupationcoder/testvacancies contains ‘test’ vacancies to run the code on

  • occupationcoder/utilities contains helper functions which mostly manipulate strings

Installation via terminal using pip

Download the package and navigate to the download directory. Then use

python setup.py sdist
cd dist
pip install occupationcoder-version.tar.gz

The first line creates the .tar.gz file, the second navigates to the directory with the packaged code in, and the third line installs the package. The version number to use will be evident from the name of the .tar.gz file.

Running the code as a python script

Importing, and creating an instance, of the coder

import pandas as pd
from occupationcoder.coder import coder
myCoder = coder.Coder()

To run the code on a single job, use the following syntax with the codejobrow(job_title,job_description,job_sector) method:

if __name__ == '__main__':
    myCoder.codejobrow('Physicist','Calculations of the universe','Professional scientific')

The if statement is required because the code is parallelised. Note that you can leave some of the fields blank and the algorithm will still return a SOC code.

To run the code on a file (eg csv name ‘job_file.csv’) with structure

job_title

job_description

job_sector

Physicist

Make calculations about the universe, do research, perform experiments and understand the physical environment.

Professional, scientific & technical activities

use

df = pd.read_csv('path/to/foo.csv')
df = myCoder.codedataframe(df)

This will return a new dataframe with SOC code entries appended in a new column:

job_title

job_description

job_sector

SOC_code

Physicist

Make calculations about the universe, do research, perform experiments and understand the physical environment.

Professional, scientific & technical activities

211

Running the code from the command line

If you have all the relevant packages in requirements.txt, download the code and navigate to the occupationcoder folder (which contains the README). Then run

python -m occupationcoder.coder.coder path/to/foo.csv

This will create a ‘processed_jobs.csv’ file in the outputs/ folder which has the original text and an extra ‘SOC_code’ column with the assigned SOC codes.

Testing

To run the tests in your virtual environment, use

python -m unittest

in the top level occupationcoder directory. Look in test_occupationcoder.py for what is run and examples of use. The output appears in the ‘processed_jobs.csv’ file in the outputs/ folder.

Acknowledgements

We are very grateful to Emmet Cassidy for testing this algorithm.

Disclaimer

This code is provided ‘as is’. We would love it if you made it better or extended it to work for other countries. All views expressed are our personal views, not those of any employer.

Credits

The development of this package was supported by the Bank of England.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.2.0 (2021-04-15)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

occupationcoder-0.2.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

occupationcoder-0.2.0-py2.py3-none-any.whl (1.7 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file occupationcoder-0.2.0.tar.gz.

File metadata

  • Download URL: occupationcoder-0.2.0.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for occupationcoder-0.2.0.tar.gz
Algorithm Hash digest
SHA256 aa977219c2b6da9a9e38d581b344c320dbfca15470f6e869a467e8459f8e07e1
MD5 8fccc78acdc178ba7f086d7388262b48
BLAKE2b-256 061ca9eeec10ffb81d7298cccd89dca0f761ef8d7a004c5588e87ed47a031f5e

See more details on using hashes here.

File details

Details for the file occupationcoder-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: occupationcoder-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for occupationcoder-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 7fe5180b4656fc7aeda14fc1a729ccbafbba010f76240dc0afbd32d983e5ec8c
MD5 66ea7f90a6e85f5725f4e5506c82d432
BLAKE2b-256 8b9a9de1a880b0ecb2f674a114956f2e12a4dbc6aea0e15765694b04a172cf58

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page