Scrape public available jobs on Linkedin using headless browser
Project description
linkedin-jobs-scraper
Scrape public available jobs on Linkedin using headless browser. For each job, the following fields are extracted:
job_id,link,apply_link,title,company,place,description,description_html,date,seniority_level,job_function,employment_type,industries.
Table of Contents
- Requirements
- Installation
- Usage
- Anonymous vs authenticated session
- Rate limiting
- Filters
- Company filter
- Logging
- License
Requirements
- Chrome or Chromium
- Chromedriver
- Python >= 3.6
Installation
Install package:
pip install linkedin-jobs-scraper
Usage
from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import Events, EventData
from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters, TypeFilters, ExperienceLevelFilters
def on_data(data: EventData):
print('[ON_DATA]', data.title, data.company, data.date, data.link, len(data.description))
def on_error(error):
print('[ON_ERROR]', error)
def on_end():
print('[ON_END]')
scraper = LinkedinScraper(
chrome_options=None, # You can pass your custom Chrome options here
headless=True, # Overrides headless mode only if chrome_options is None
max_workers=1, # How many threads will be spawned to run queries concurrently (one Chrome driver for each thread)
slow_mo=0.4, # Slow down the scraper to avoid 'Too many requests (429)' errors
)
# Add event listeners
scraper.on(Events.DATA, on_data)
scraper.on(Events.ERROR, on_error)
scraper.on(Events.END, on_end)
queries = [
Query(
options=QueryOptions(
optimize=True, # Blocks requests for resources like images and stylesheet
limit=27 # Limit the number of jobs to scrape
)
),
Query(
query='Engineer',
options=QueryOptions(
locations=['United States'],
optimize=False,
limit=5,
filters=QueryFilters(
company_jobs_url='https://www.linkedin.com/jobs/search/?f_C=1441%2C17876832%2C791962%2C2374003%2C18950635%2C16140%2C10440912&geoId=92000000', # Filter by companies
relevance=RelevanceFilters.RECENT,
time=TimeFilters.MONTH,
type=[TypeFilters.FULL_TIME, TypeFilters.INTERNSHIP],
experience=None,
)
)
),
]
scraper.run(queries)
Anonymous vs authenticated session
By default the scraper will run in anonymous mode (no authentication required). In some environments (e.g. AWS or Heroku) this may be not possible though. You may face the following error message:
Scraper failed to run in anonymous mode, authentication may be necessary for this environment.
In that case the only option available is to run using an authenticated session. These are the steps required:
- Login to LinkedIn using an account of your choice.
- Open Chrome developer tools:
- Go to tab
Application, then from left panel selectStorage->Cookies->https://www.linkedin.com. In the main view locate row with nameli_atand copy content from the columnValue.
- Set the environment variable
LI_AT_COOKIEwith the value obtained in step 3, then run your application as normal. Example:
LI_AT_COOKIE=<your li_at cookie value here> python your_app.py
Rate limiting
You may experience the following rate limiting warning during execution:
[429] Too many requests. You should probably increase scraper "slow_mo" value or reduce concurrency.
This means you are exceeding the number of requests per second allowed by the server (this is especially true when using authenticated sessions where the rate limits are much more strict). You can overcome this by:
- Trying a higher value for
slow_moparameter (this will slow down scraper execution). - Reducing the value of
max_workersto limit concurrency. I recommend to use no more than one worker in authenticated mode.
Filters
It is possible to customize queries with the following filters:
- RELEVANCE:
RELEVANTRECENT
- TIME:
DAYWEEKMONTHANY
- TYPE:
FULL_TIMEPART_TIMETEMPORARYCONTRACT
- EXPERIENCE LEVEL:
INTERNSHIPENTRY_LEVELASSOCIATEMID_SENIORDIRECTOR
See the following example for more details:
from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters, TypeFilters, ExperienceLevelFilters
query = Query(
query='Engineer',
options=QueryOptions(
locations=['United States'],
optimize=False,
limit=5,
filters=QueryFilters(
relevance=RelevanceFilters.RECENT,
time=TimeFilters.MONTH,
type=[TypeFilters.FULL_TIME, TypeFilters.INTERNSHIP],
experience=[ExperienceLevelFilters.INTERNSHIP, ExperienceLevelFilters.MID_SENIOR],
)
)
)
Company Filter
It is also possible to filter by company using the public company jobs url on LinkedIn. To find this url you have to:
- Login to LinkedIn using an account of your choice.
- Go to the LinkedIn page of the company you are interested in (e.g. https://www.linkedin.com/company/google).
- Click on
jobsfrom the left menu.
- Scroll down and locate
See all jobsorSee jobsbutton.
- Right click and copy link address (or navigate the link and copy it from the address bar).
- Paste the link address in code as follows:
query = Query(
options=QueryOptions(
filters=QueryFilters(
# Paste link below
company_jobs_url='https://www.linkedin.com/jobs/search/?f_C=1441%2C17876832%2C791962%2C2374003%2C18950635%2C16140%2C10440912&geoId=92000000',
)
)
)
Logging
Package logger can be retrieved using namespace li:scraper. Default level is INFO.
It is possible to change logger level using environment variable LOG_LEVEL or in code:
import logging
logging.getLogger('li:scraper').setLevel(logging.DEBUG)
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file linkedin-jobs-scraper-1.2.1.tar.gz.
File metadata
- Download URL: linkedin-jobs-scraper-1.2.1.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ec93cc04026d4edb043a9a03c8ef6378b1dd7a186ab2c0b526092b7332ad059
|
|
| MD5 |
c19016793de8ecbeeeec89239b707e32
|
|
| BLAKE2b-256 |
89bfbbe1921e7986e989971b8e25f199ebae68374d801a67220dbde19c2e285e
|
File details
Details for the file linkedin_jobs_scraper-1.2.1-py3-none-any.whl.
File metadata
- Download URL: linkedin_jobs_scraper-1.2.1-py3-none-any.whl
- Upload date:
- Size: 26.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4642004911e9f2312123199f0babfb8f01e98e6f2b76f919e6fafce7d69f4dc3
|
|
| MD5 |
2958eb022cda848aa736555410bf7f0c
|
|
| BLAKE2b-256 |
e3ce9fa1282527be3b7a87311de7d79bb33415f0875359a3988bb86f01cf0b23
|