Generic job listings scraper with baseline dedupe and integration with the open jobpool
Project description
Pooled Job Scraper
pooled-job-scraper is published on PyPI and provides a CLI for scraping business careers pages, extracting listings, and generating a unique delta against the April 2026 baseline dataset.
PyPI package:
Platform Links
How They Fit Together
jobpool.liveis the open data pool and hydration surface.
Use this scraper to discover and normalize job listings, then review unique delta rows before promoting data into the pool workflow.mewannajob.comis the consumer-facing experience for browsing and using listings data.
Data prepared through the pool process ultimately supports downstream job discovery use cases there.
Documentation Surfaces
- Public pool context and hydration navigation: jobpool.live
- Hydration docs in this repository:
pool/hydration/docs/ - Scraper implementation in this repository:
scripts/generic_job_listings_scraper.py
Install
Windows
py -m pip install --upgrade pooled-job-scraper
WSL / Linux / macOS
python3 -m pip install --upgrade pooled-job-scraper
Usage
job-scraper \
--business-url https://mossyhonda.hireology.careers/ \
--company-name "Mossy Honda" \
--output output/mossy-scraped.csv \
--unique-output output/mossy-unique.csv
Live Progress + Rate Controls
The CLI now shows a live status bar with:
- percent complete
- ETA
- observed request rate (req/s)
- derived safe request rate from observed site limits (
429,Retry-After, andX-RateLimit-*headers)
During runs, a persistent prompt stays active until completion:
rate-control>
Supported prompt commands:
rate <rps>(example:rate 1.2)delay <seconds>(example:delay 0.8)auto(return to adaptive pacing)status(show current derived limits/rate)help
Disable the interactive prompt when needed:
job-scraper \
--business-url https://mossyhonda.hireology.careers/ \
--company-name "Mossy Honda" \
--no-control-prompt \
--output output/mossy-scraped.csv \
--unique-output output/mossy-unique.csv
Field Enrichment + Limits
The scraper now enriches records when source pages are sparse:
- Derives
job_summarywhen absent. - Derives
job_posted_datefrom available text/URL patterns, with ingest date fallback. - Derives
job_industriesfrom curated company+industry hints, baseline company patterns, and keyword hints. - Applies sensible per-column word caps for listing-style data quality.
Default Cache Behavior
By default, unique rows are posted to:
https://jobpool.live/api/scrape-cache
job-scraper \
--business-url https://mossyhonda.hireology.careers/ \
--company-name "Mossy Honda" \
--output output/mossy-scraped.csv \
--unique-output output/mossy-unique.csv
The scraper infers user_name from local environment or git config and sends:
user_namerequest_timestampsource_business_urlslistings(standard listing fields plus any additional discovered fields)
Disable Cache Submission
Use the only cache-related flag when you need to skip cache persistence:
job-scraper \
--business-url https://mossyhonda.hireology.careers/ \
--company-name "Mossy Honda" \
--disable-cache \
--output output/mossy-scraped.csv \
--unique-output output/mossy-unique.csv
Cache API
POST /api/scrape-cachestores a scrape request payload.GET /api/scrape-cache?limit=25&user_name=<name>returns recent cached submissions.GET /api/scrape-cache?leaderboard=1&leaderboard_limit=20returns GitHub user leaderboard data withpreprod_recordsandprod_records.
Publishing Flow
Publishing is automated through:
.github/workflows/publish-pypi.yml
Behavior:
- Triggers on push/merge to
main. - Builds distributions from
pyproject.toml. - Checks whether the current version already exists on PyPI.
- Publishes only when the version is new.
- Skips cleanly when that version already exists.
To release a new version:
- Bump
project.versioninpyproject.toml. - Merge to
main. - Wait for the publish workflow to complete.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pooled_job_scraper-0.1.3.tar.gz.
File metadata
- Download URL: pooled_job_scraper-0.1.3.tar.gz
- Upload date:
- Size: 22.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6e3a90893ae3af62945ea6b3548e74bd50bf33acb5928929fe6251eee96ddce
|
|
| MD5 |
e2f83d8e8c43dedf10f21b97e469e44b
|
|
| BLAKE2b-256 |
00722af62eb451f36f983f896b60a92712b3b5e22589cc9b6ecd6e9bf869c8c7
|
Provenance
The following attestation bundles were made for pooled_job_scraper-0.1.3.tar.gz:
Publisher:
publish-pypi.yml on lramos0/livejobpool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pooled_job_scraper-0.1.3.tar.gz -
Subject digest:
a6e3a90893ae3af62945ea6b3548e74bd50bf33acb5928929fe6251eee96ddce - Sigstore transparency entry: 1358940365
- Sigstore integration time:
-
Permalink:
lramos0/livejobpool@7942e020dc9f20137aac6596fff056bca1a312df -
Branch / Tag:
refs/heads/main - Owner: https://github.com/lramos0
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@7942e020dc9f20137aac6596fff056bca1a312df -
Trigger Event:
push
-
Statement type:
File details
Details for the file pooled_job_scraper-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pooled_job_scraper-0.1.3-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4df86025f59a95ac36b5eee8621ba1e1a14c3afb5e8ab91a6c67b71d659954ed
|
|
| MD5 |
3f8c14e1dc528f5b396ba1121449c3e7
|
|
| BLAKE2b-256 |
570032e306b2ef3a096bfd4744e9a2ed12330c8a61a4656a2d039909031c4200
|
Provenance
The following attestation bundles were made for pooled_job_scraper-0.1.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on lramos0/livejobpool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pooled_job_scraper-0.1.3-py3-none-any.whl -
Subject digest:
4df86025f59a95ac36b5eee8621ba1e1a14c3afb5e8ab91a6c67b71d659954ed - Sigstore transparency entry: 1358940382
- Sigstore integration time:
-
Permalink:
lramos0/livejobpool@7942e020dc9f20137aac6596fff056bca1a312df -
Branch / Tag:
refs/heads/main - Owner: https://github.com/lramos0
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@7942e020dc9f20137aac6596fff056bca1a312df -
Trigger Event:
push
-
Statement type: