Skip to main content

Generic job listings scraper with baseline dedupe and integration with the open jobpool

Project description

Pooled Job Scraper

pooled-job-scraper is published on PyPI and provides a CLI for scraping business careers pages, extracting listings, and generating a unique delta against the April 2026 baseline dataset.

PyPI package:

Platform Links

How They Fit Together

  • jobpool.live is the open data pool and hydration surface.
    Use this scraper to discover and normalize job listings, then review unique delta rows before promoting data into the pool workflow.
  • mewannajob.com is the consumer-facing experience for browsing and using listings data.
    Data prepared through the pool process ultimately supports downstream job discovery use cases there.

Documentation Surfaces

  • Public pool context and hydration navigation: jobpool.live
  • Hydration docs in this repository: pool/hydration/docs/
  • Scraper implementation in this repository: scripts/generic_job_listings_scraper.py

Install

Windows

py -m pip install --upgrade pooled-job-scraper

WSL / Linux / macOS

python3 -m pip install --upgrade pooled-job-scraper

Usage

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Live Progress + Rate Controls

The CLI now shows a live status bar with:

  • percent complete
  • ETA
  • observed request rate (req/s)
  • derived safe request rate from observed site limits (429, Retry-After, and X-RateLimit-* headers)

During runs, a persistent prompt stays active until completion:

rate-control>

Supported prompt commands:

  • rate <rps> (example: rate 1.2)
  • delay <seconds> (example: delay 0.8)
  • auto (return to adaptive pacing)
  • status (show current derived limits/rate)
  • help

Disable the interactive prompt when needed:

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --no-control-prompt \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Field Enrichment + Limits

The scraper now enriches records when source pages are sparse:

  • Derives job_summary when absent.
  • Derives job_posted_date from available text/URL patterns, with ingest date fallback.
  • Derives job_industries from curated company+industry hints, baseline company patterns, and keyword hints.
  • Applies sensible per-column word caps for listing-style data quality.

Default Cache Behavior

By default, unique rows are posted to:

  • https://jobpool.live/api/scrape-cache
job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

The scraper infers user_name from local environment or git config and sends:

  • user_name
  • request_timestamp
  • source_business_urls
  • listings (standard listing fields plus any additional discovered fields)

Disable Cache Submission

Use the only cache-related flag when you need to skip cache persistence:

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --disable-cache \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Cache API

  • POST /api/scrape-cache stores a scrape request payload.
  • GET /api/scrape-cache?limit=25&user_name=<name> returns recent cached submissions.
  • GET /api/scrape-cache?leaderboard=1&leaderboard_limit=20 returns GitHub user leaderboard data with preprod_records and prod_records.

Publishing Flow

Publishing is automated through:

  • .github/workflows/publish-pypi.yml

Behavior:

  • Triggers on push/merge to main.
  • Builds distributions from pyproject.toml.
  • Checks whether the current version already exists on PyPI.
  • Publishes only when the version is new.
  • Skips cleanly when that version already exists.

To release a new version:

  1. Bump project.version in pyproject.toml.
  2. Merge to main.
  3. Wait for the publish workflow to complete.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pooled_job_scraper-0.1.3.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pooled_job_scraper-0.1.3-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file pooled_job_scraper-0.1.3.tar.gz.

File metadata

  • Download URL: pooled_job_scraper-0.1.3.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pooled_job_scraper-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a6e3a90893ae3af62945ea6b3548e74bd50bf33acb5928929fe6251eee96ddce
MD5 e2f83d8e8c43dedf10f21b97e469e44b
BLAKE2b-256 00722af62eb451f36f983f896b60a92712b3b5e22589cc9b6ecd6e9bf869c8c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.3.tar.gz:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pooled_job_scraper-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pooled_job_scraper-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4df86025f59a95ac36b5eee8621ba1e1a14c3afb5e8ab91a6c67b71d659954ed
MD5 3f8c14e1dc528f5b396ba1121449c3e7
BLAKE2b-256 570032e306b2ef3a096bfd4744e9a2ed12330c8a61a4656a2d039909031c4200

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.3-py3-none-any.whl:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page