scrape-jobs

CLI jobs scraper targeting multiple sites

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Simple CLI jobs scraper

DISCLAIMER

Made as POC

USE AT YOUR OWN RISK

Workflow

Scrape jobs matching certain criteria

linkedin.com - [ keywords , location ]

seek.com.au - [ what , where ]

Store the scraped data

upload to Google Sheets.

or just store it locally to CSV file.

Installation

Prerequisites:

Google Chrome installed

chromedriver binary executable in PATH

Python 3.6+

Prepared Google Spreadsheet (detailed instructions bellow)

Install command:

pip install –force -U scrape-jobs

Installs the following CLI bindings:

scrape-jobs-init-config

scrape-jobs

Basic Instructions

Open terminal/CMD and ‘cd’ into working dir.
Run scrape-jobs-init-config to populate sample config file in the current work dir

usage: scrape-jobs-init-config [-h] [–version] [-v] [-vv] [-f FILE]

initialize sample ‘scrape-jobs’ config file

optional arguments:

-h, --help

show this help message and exit

--version

show program’s version number and exit

-v, --verbose

set loglevel to INFO

-vv, --very-verbose

set loglevel to DEBUG

-f FILE

defaults to ‘/current/work/dir/scrape-jobs.ini’
Edit the config file as per your needs
Run scrape-jobs to trigger execution

usage: scrape-jobs [-h] [–version] [-v] [-vv] [-c CONFIG_FILE]

{linkedin.com,seek.com.au}

Scrape jobs and store results.

positional arguments:

{linkedin.com,seek.com.au}

site to scrape

optional arguments:

-h, --help

show this help message and exit

--version

show program’s version number and exit

-v, --verbose

set loglevel to INFO

-vv, --very-verbose

set loglevel to DEBUG

-c CONFIG_FILE

defaults to ‘/current/work/dir/scrape-jobs.ini’

More Detailed Instructions:

Prepare the spreadsheet and the spreadsheet’s auth.json (Spreadsheet instructions at the bottom)
- For seek.com.au the Worksheet’s columns are:
  
  [“scraped_time”, “posted_time”, “location”, “area”, “classification”, “sub_classification”, “title”, “salary”, “company”, “url”]
- For linkedin.com the Worksheet’s columns are:
  
  [“scraped_time”, “posted_time”, “location”, “title”, “company”, “url”]
Init empty config file by calling scrape-jobs-init-config
Edit the newly created CWD\scrape-jobs.ini with params of your choice
- set the path to the AUTH.JSON
- set the spreadsheet name
- set the worksheet name (it will be automatically created if it doesn’t exist)
- set the search params
Trigger execution:
- run scrape-jobs linkedin.com or scrape-jobs seek.com.au
- you will see output in the console, but a scrape-jobs.log will be created too
- to have more detailed output add -vv execution param
After the scrape is complete you should see the newly discovered jobs in your spreadsheet
Alternatively you can init a config at a known place and just pass it’s path:

scrape-jobs-init-config -f /custom/path/to/config.ini

scrape-jobs -c /custom/path/to/config.ini seek.com.au

Note

You need to prepare AUTH.JSON file in advance that is to be used for authentication with GoogleSheets

The term ‘Spreadsheet’ refers to a single document that is shown in the GoogleSpreadsheets landing page

A single ‘Spreadsheet’ can contain one or more ‘Worksheets’

Usually a newly created ‘Spreadsheet’ contains a single ‘Worksheet’ named ‘Sheet1’

If you don’t provide a valid path to AUTH.JSON the collected data will be saved as .csv in the current work dir

Instructions for preparing Google Spreadsheet AUTH.JSON:

Go to https://console.developers.google.com/

Login with the google account that is to be owner of the ‘Spreadsheet’.

At the top-left corner, there is a drop-down right next to the “Google APIs” text

Click the drop-down and a modal-dialog will appear, then click “NEW PROJECT” at it’s top-right

Name the project relevant to how the sheet is to be used, don’t select ‘Location*’, just press ‘CREATE’

Open the newly created project from the same drop-down as in step 3.

There should be ‘APIs’ area with a “-> Go to APIs overview” at it’s bottom - click it

A new page will load having ‘+ ENABLE APIS AND SERVICES’ button at the top side’s middle - click it

A new page will load having a ‘Search for APIs & Services’ input - use it to find and open ‘Google Drive API’

In the ‘Google Drive API’ page click “ENABLE” - you’ll be redirected back to the project’s page

There will be a new ‘CREATE CREDENTIALS’ button at the top - click it

Setup the new credentials as follows:

Which API are you using? -> ‘Google Drive API’

Where will you be calling the API from? -> ‘Web server (e.g. node.js, Tomcat)

What data will you be accessing? -> ‘Application data’

Are you planning to use this API with App Engine or Compute Engine? -> No, I’m not using them.

Click the blue button ‘What credentials do I need’, will take you to ‘Add credentials to you project’ page

Setup the credentials as follows:

Service account name: {whatever name you type is OK, as long the input accepts it}

Role: Project->Editor

Key type: JSON

Press the blue ‘Continue’ button, and a download of the AUTH.JSON file will begin (store it safe)

Close the modal and go back to the project ‘Dashboard’ using the left-side navigation panel

Repeat step 8.

Search for ‘Google Sheets API’, then open the result and click the blue ‘ENABLE’ button

Open the downloaded auth.json file and copy the value of the ‘client_email’

Using the same google account as in step 2. , go to the normal google sheets and create & open the ‘Spreadsheet’

do a final renaming to the spreadsheet now to avoid issues in future

‘Share’ the document with the email copied in step 19., giving it ‘Edit’ permissions

you might want to un-tick ‘Notify people’ before clicking ‘Send’ as it’s a service email you’re sharing with

‘Send’ will change to ‘OK’ upon un-tick, but we’re cool with that - just click it.

You are now ready to use this class for retrieving ‘Spreadsheet’ handle in the code!

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

3.0.2

Jan 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_jobs-3.0.2.zip (43.2 kB view hashes)

Uploaded Jan 26, 2020 Source

Built Distribution

scrape_jobs-3.0.2-py3-none-any.whl (23.1 kB view hashes)

Uploaded Jan 26, 2020 Python 3

Hashes for scrape_jobs-3.0.2.zip

Hashes for scrape_jobs-3.0.2.zip
Algorithm	Hash digest
SHA256	`a1e570c594d38abadfe282c0cae341d35ec91453a0d13a0445efd811d6bbba72`
MD5	`b5683adbaaa1c6fa3faf45b0630d77ef`
BLAKE2b-256	`6478e9e2ad6f830a6882dc1be97ec608d46e0c3350aecc56ce06c2ddc213cdb7`

Hashes for scrape_jobs-3.0.2-py3-none-any.whl

Hashes for scrape_jobs-3.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0dae267e1046e3036c39f0ce3f00e0c4ceb029df5ff22a58e7bc0c39779d0a9b`
MD5	`0e515f516d9d10d51dca7998d040718a`
BLAKE2b-256	`f61638a804832f87fc1e94110e18e4aadf93aa0ca029092ffd6ea54e7c8e4f22`