Generic job listings scraper with baseline dedupe and integration with the open jobpool

These details have not been verified by PyPI

Project description

Pooled Job Scraper

pooled-job-scraper is published on PyPI and provides a CLI for scraping business careers pages, extracting listings, and generating a unique delta against the April 2026 baseline dataset.

PyPI package:

https://pypi.org/project/pooled-job-scraper/

Platform Links

How They Fit Together

jobpool.live is the open data pool and hydration surface.
Use this scraper to discover and normalize job listings, then review unique delta rows before promoting data into the pool workflow.
mewannajob.com is the consumer-facing experience for browsing and using listings data.
Data prepared through the pool process ultimately supports downstream job discovery use cases there.

Documentation Surfaces

Public pool context and hydration navigation: jobpool.live
Hydration docs in this repository: pool/hydration/docs/
Scraper implementation in this repository: scripts/generic_job_listings_scraper.py

Install

Windows

py -m pip install --upgrade pooled-job-scraper

WSL / Linux / macOS

python3 -m pip install --upgrade pooled-job-scraper

Usage

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Live Progress + Rate Controls

The CLI now shows a live status bar with:

percent complete
ETA
observed request rate (req/s)
derived safe request rate from observed site limits (429, Retry-After, and X-RateLimit-* headers)

During runs, a persistent prompt stays active until completion:

rate-control>

Supported prompt commands:

rate <rps> (example: rate 1.2)
delay <seconds> (example: delay 0.8)
auto (return to adaptive pacing)
status (show current derived limits/rate)
help

Disable the interactive prompt when needed:

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --no-control-prompt \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Field Enrichment + Limits

The scraper now enriches records when source pages are sparse:

Derives job_summary when absent.
Derives job_posted_date from available text/URL patterns, with ingest date fallback.
Derives job_industries from curated company+industry hints, baseline company patterns, and keyword hints.
Applies sensible per-column word caps for listing-style data quality.

Default Cache Behavior

By default, unique rows are posted to:

https://jobpool.live/api/scrape-cache

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

The scraper infers user_name from local environment or git config and sends:

user_name
request_timestamp
source_business_urls
listings (standard listing fields plus any additional discovered fields)

Disable Cache Submission

Use the only cache-related flag when you need to skip cache persistence:

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --disable-cache \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Cache API

POST /api/scrape-cache stores a scrape request payload.
GET /api/scrape-cache?limit=25&user_name=<name> returns recent cached submissions.
GET /api/scrape-cache?leaderboard=1&leaderboard_limit=20 returns GitHub user leaderboard data with preprod_records and prod_records.

Publishing Flow

Publishing is automated through:

.github/workflows/publish-pypi.yml

Behavior:

Triggers on push/merge to main.
Builds distributions from pyproject.toml.
Checks whether the current version already exists on PyPI.
Publishes only when the version is new.
Skips cleanly when that version already exists.

To release a new version:

Bump project.version in pyproject.toml.
Merge to main.
Wait for the publish workflow to complete.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Apr 22, 2026

0.1.2

Apr 22, 2026

0.1.1

Apr 22, 2026

0.1.0

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pooled_job_scraper-0.1.3.tar.gz (22.0 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pooled_job_scraper-0.1.3-py3-none-any.whl (20.9 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file pooled_job_scraper-0.1.3.tar.gz.

File metadata

Download URL: pooled_job_scraper-0.1.3.tar.gz
Upload date: Apr 22, 2026
Size: 22.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pooled_job_scraper-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`a6e3a90893ae3af62945ea6b3548e74bd50bf33acb5928929fe6251eee96ddce`
MD5	`e2f83d8e8c43dedf10f21b97e469e44b`
BLAKE2b-256	`00722af62eb451f36f983f896b60a92712b3b5e22589cc9b6ecd6e9bf869c8c7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.3.tar.gz:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pooled_job_scraper-0.1.3.tar.gz
- Subject digest: a6e3a90893ae3af62945ea6b3548e74bd50bf33acb5928929fe6251eee96ddce
- Sigstore transparency entry: 1358940365
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: lramos0/livejobpool@7942e020dc9f20137aac6596fff056bca1a312df
- Branch / Tag: refs/heads/main
- Owner: https://github.com/lramos0
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@7942e020dc9f20137aac6596fff056bca1a312df
- Trigger Event: push

File details

Details for the file pooled_job_scraper-0.1.3-py3-none-any.whl.

File metadata

Download URL: pooled_job_scraper-0.1.3-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pooled_job_scraper-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4df86025f59a95ac36b5eee8621ba1e1a14c3afb5e8ab91a6c67b71d659954ed`
MD5	`3f8c14e1dc528f5b396ba1121449c3e7`
BLAKE2b-256	`570032e306b2ef3a096bfd4744e9a2ed12330c8a61a4656a2d039909031c4200`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.3-py3-none-any.whl:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pooled_job_scraper-0.1.3-py3-none-any.whl
- Subject digest: 4df86025f59a95ac36b5eee8621ba1e1a14c3afb5e8ab91a6c67b71d659954ed
- Sigstore transparency entry: 1358940382
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: lramos0/livejobpool@7942e020dc9f20137aac6596fff056bca1a312df
- Branch / Tag: refs/heads/main
- Owner: https://github.com/lramos0
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@7942e020dc9f20137aac6596fff056bca1a312df
- Trigger Event: push

pooled-job-scraper 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Pooled Job Scraper

Platform Links

How They Fit Together

Documentation Surfaces

Install

Windows

WSL / Linux / macOS

Usage

Live Progress + Rate Controls

Field Enrichment + Limits

Default Cache Behavior

Disable Cache Submission

Cache API

Publishing Flow

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance