CLI jobs scraper targeting multiple sites
Project description
Simple CLI jobs scraper
DISCLAIMER
Made as POC
USE AT YOUR OWN RISK
Workflow
Scrape jobs matching certain criteria
linkedin.com - [ keywords , location ]
seek.com.au - [ what , where ]
Store the scraped data
upload to Google Sheets.
or just store it locally to CSV file.
Installation
Prerequisites:
Google Chrome installed
chromedriver binary executable in PATH
Python 3.6+
Prepared Google Spreadsheet (detailed instructions bellow)
Install command:
pip install –force -U scrape-jobs
Installs the following CLI bindings:
scrape-jobs-init-config
scrape-jobs
Basic Instructions
Open terminal/CMD and ‘cd’ into working dir.
Run scrape-jobs-init-config to populate sample config file in the current work dir
usage: scrape-jobs-init-config [-h] [–version] [-v] [-vv] [-f FILE]
initialize sample ‘scrape-jobs’ config file
- optional arguments:
- -h, --help
show this help message and exit
- --version
show program’s version number and exit
- -v, --verbose
set loglevel to INFO
- -vv, --very-verbose
set loglevel to DEBUG
- -f FILE
defaults to ‘/current/work/dir/scrape-jobs.ini’
Edit the config file as per your needs
Run scrape-jobs to trigger execution
- usage: scrape-jobs [-h] [–version] [-v] [-vv] [-c CONFIG_FILE]
{linkedin.com,seek.com.au}
Scrape jobs and store results.
- positional arguments:
- {linkedin.com,seek.com.au}
site to scrape
- optional arguments:
- -h, --help
show this help message and exit
- --version
show program’s version number and exit
- -v, --verbose
set loglevel to INFO
- -vv, --very-verbose
set loglevel to DEBUG
- -c CONFIG_FILE
defaults to ‘/current/work/dir/scrape-jobs.ini’
More Detailed Instructions:
Prepare the spreadsheet and the spreadsheet’s auth.json (Spreadsheet instructions at the bottom)
For seek.com.au the Worksheet’s columns are:
[“scraped_time”, “posted_time”, “location”, “area”, “classification”, “sub_classification”, “title”, “salary”, “company”, “url”]
For linkedin.com the Worksheet’s columns are:
[“scraped_time”, “posted_time”, “location”, “title”, “company”, “url”]
Init empty config file by calling scrape-jobs-init-config
Edit the newly created CWD\scrape-jobs.ini with params of your choice
set the path to the AUTH.JSON
set the spreadsheet name
set the worksheet name (it will be automatically created if it doesn’t exist)
set the search params
Trigger execution:
run scrape-jobs linkedin.com or scrape-jobs seek.com.au
you will see output in the console, but a scrape-jobs.log will be created too
to have more detailed output add -vv execution param
After the scrape is complete you should see the newly discovered jobs in your spreadsheet
Alternatively you can init a config at a known place and just pass it’s path:
scrape-jobs-init-config -f /custom/path/to/config.ini
scrape-jobs -c /custom/path/to/config.ini seek.com.au
Note
You need to prepare AUTH.JSON file in advance that is to be used for authentication with GoogleSheets
The term ‘Spreadsheet’ refers to a single document that is shown in the GoogleSpreadsheets landing page
A single ‘Spreadsheet’ can contain one or more ‘Worksheets’
Usually a newly created ‘Spreadsheet’ contains a single ‘Worksheet’ named ‘Sheet1’
If you don’t provide a valid path to AUTH.JSON the collected data will be saved as .csv in the current work dir
Instructions for preparing Google Spreadsheet AUTH.JSON:
Login with the google account that is to be owner of the ‘Spreadsheet’.
At the top-left corner, there is a drop-down right next to the “Google APIs” text
Click the drop-down and a modal-dialog will appear, then click “NEW PROJECT” at it’s top-right
Name the project relevant to how the sheet is to be used, don’t select ‘Location*’, just press ‘CREATE’
Open the newly created project from the same drop-down as in step 3.
There should be ‘APIs’ area with a “-> Go to APIs overview” at it’s bottom - click it
A new page will load having ‘+ ENABLE APIS AND SERVICES’ button at the top side’s middle - click it
A new page will load having a ‘Search for APIs & Services’ input - use it to find and open ‘Google Drive API’
In the ‘Google Drive API’ page click “ENABLE” - you’ll be redirected back to the project’s page
There will be a new ‘CREATE CREDENTIALS’ button at the top - click it
Setup the new credentials as follows:
Which API are you using? -> ‘Google Drive API’
Where will you be calling the API from? -> ‘Web server (e.g. node.js, Tomcat)
What data will you be accessing? -> ‘Application data’
Are you planning to use this API with App Engine or Compute Engine? -> No, I’m not using them.
Click the blue button ‘What credentials do I need’, will take you to ‘Add credentials to you project’ page
Setup the credentials as follows:
Service account name: {whatever name you type is OK, as long the input accepts it}
Role: Project->Editor
Key type: JSON
Press the blue ‘Continue’ button, and a download of the AUTH.JSON file will begin (store it safe)
Close the modal and go back to the project ‘Dashboard’ using the left-side navigation panel
Repeat step 8.
Search for ‘Google Sheets API’, then open the result and click the blue ‘ENABLE’ button
Open the downloaded auth.json file and copy the value of the ‘client_email’
Using the same google account as in step 2. , go to the normal google sheets and create & open the ‘Spreadsheet’
do a final renaming to the spreadsheet now to avoid issues in future
‘Share’ the document with the email copied in step 19., giving it ‘Edit’ permissions
you might want to un-tick ‘Notify people’ before clicking ‘Send’ as it’s a service email you’re sharing with
‘Send’ will change to ‘OK’ upon un-tick, but we’re cool with that - just click it.
You are now ready to use this class for retrieving ‘Spreadsheet’ handle in the code!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrape_jobs-3.0.2.zip
.
File metadata
- Download URL: scrape_jobs-3.0.2.zip
- Upload date:
- Size: 43.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1e570c594d38abadfe282c0cae341d35ec91453a0d13a0445efd811d6bbba72 |
|
MD5 | b5683adbaaa1c6fa3faf45b0630d77ef |
|
BLAKE2b-256 | 6478e9e2ad6f830a6882dc1be97ec608d46e0c3350aecc56ce06c2ddc213cdb7 |
File details
Details for the file scrape_jobs-3.0.2-py3-none-any.whl
.
File metadata
- Download URL: scrape_jobs-3.0.2-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0dae267e1046e3036c39f0ce3f00e0c4ceb029df5ff22a58e7bc0c39779d0a9b |
|
MD5 | 0e515f516d9d10d51dca7998d040718a |
|
BLAKE2b-256 | f61638a804832f87fc1e94110e18e4aadf93aa0ca029092ffd6ea54e7c8e4f22 |