A small app to grab job postings from online job boards
Project description
Introduction
Job boards (like LinkedIn) can be a good source for finding job openings. Unfortunately the search results cannot always be filtered to a usable degree. Exfill (short for extraction) lets users scrape and parse jobs with more flexability provided by the default search.
Currently only LinkedIn is supported.
Project Structure
Directories:
src/exfill/parsers
- Contains parser(s)src/exfill/scrapers
- Contains scraper(s)src/exfill/support
- Contains
geckodriver
driver for FireFox which is used by Selenium - Download the latest driver from the Mozilla GeckoDriver repo in GitHub
- Contains
data/html
- Not in source control
- Contains HTML elements for a specific job posting
- Populated by a scraper
data/csv
- Not in source control
- Contains parsed information in a csv table
- Populated by a parser
- Also contains an error table
logs
- Not in source control
- Contains logs created during execution
creds.json
File
Syntax should be as follows:
{
"linkedin": {
"username": "jay-law@protonmail.com",
"password": "password1"
}
}
Usage
There are two actions
required to generate usable data:
First is the scraping action. When called, a browser will open and perform a job query on the specified site
. Each posting will be exported to the data/html
directory.
The second action is parsing. Each job posting in data/html
will be opened and analyzed. Once all postings have been analyzed a single CSV file will be exported to data/csv
.
The csv file provides a high-level overview of all the jobs returned during the query. When imported to a spreadsheet, users can filter on fields not present in the original search options. Examples include sorting by companies or excluding certain industries.
Use as Code
# Install with git
$ git clone git@github.com:jay-law/job-scraper.git
# Create and populate creds.json. Bash only:
cat <<EOF > creds.json
{
"linkedin": {
"username": "jay-law@protonmail.com",
"password": "password1"
}
}
EOF
# Activate virtual env
$ poetry shell
# Install dependencies
$ poetry install # all deps
$ poetry install --no-dev # don't install linters/formatters
# Execute - Scrape linkedin
$ python3 exfill/extractor.py linkedin scrape
# Execute - Parse linkedin
$ python3 exfill/extractor.py linkedin parse
Use as Module
NOTE - This was broken during the implementation of poetry. It will be fixed soon... Hopefully
# Install
$ python3 -m pip install --upgrade exfill
# Execute - Scrape linkedin
$ python3 -m exfill.extractor linkedin scrape
# Execute - Parse linkedin
$ python3 -m exfill.extractor linkedin parse
Roadmap
- Write unit tests
- Improve secret handling
- Add packaging
- Move paths to config file
- Move keyword logic
- Set/include default config.ini for users installing with PIP
- Add CICD
- Automate versioning
- Add formatter (black module)
- Add static type checking (mypy module)
- Add import sorter (isort module)
- Add linter (flake8 module)
- Update string interpolation from %f to f-string
- Replace sys.exit calls with exceptions
- Update how the config object is accessed
- Migrate to
poetry
for virtual env, building, and publishing - Replace os.path usage with pathlib
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.