Skip to main content

Generic job listings scraper with baseline dedupe and optional Netlify cache submission.

Project description

Pooled Job Scraper

pooled-job-scraper is published on PyPI and provides a CLI for scraping business careers pages, extracting listings, and generating a unique delta against the April 2026 baseline dataset.

PyPI package:

Platform Links

How They Fit Together

  • jobpool.live is the open data pool and hydration surface.
    Use this scraper to discover and normalize job listings, then review unique delta rows before promoting data into the pool workflow.
  • mewannajob.com is the consumer-facing experience for browsing and using listings data.
    Data prepared through the pool process ultimately supports downstream job discovery use cases there.

Documentation Surfaces

  • Public pool context and hydration navigation: jobpool.live
  • Hydration docs in this repository: pool/hydration/docs/
  • Scraper implementation in this repository: scripts/generic_job_listings_scraper.py

Install

Windows

py -m pip install --upgrade pooled-job-scraper

WSL / Linux / macOS

python3 -m pip install --upgrade pooled-job-scraper

Run (Installed CLI)

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Run (From Repository Script)

py scripts/generic_job_listings_scraper.py `
  --business-url https://mossyhonda.hireology.careers/ `
  --company-name "Mossy Honda" `
  --output output/mossy-scraped.csv `
  --unique-output output/mossy-unique.csv

Send Unique Rows To Netlify Cache

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --cache-endpoint https://<your-netlify-site>/api/scrape-cache \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

The scraper infers user_name from local environment or git config and sends:

  • user_name
  • request_timestamp
  • source_business_urls
  • listings (standard listing fields plus any additional discovered fields)

Cache API

  • POST /api/scrape-cache stores a scrape request payload.
  • GET /api/scrape-cache?limit=25&user_name=<name> returns recent cached submissions.

Publishing Flow

Publishing is automated through:

  • .github/workflows/publish-pypi.yml

Behavior:

  • Triggers on push/merge to main.
  • Builds distributions from pyproject.toml.
  • Checks whether the current version already exists on PyPI.
  • Publishes only when the version is new.
  • Skips cleanly when that version already exists.

To release a new version:

  1. Bump project.version in pyproject.toml.
  2. Merge to main.
  3. Wait for the publish workflow to complete.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pooled_job_scraper-0.1.1.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pooled_job_scraper-0.1.1-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file pooled_job_scraper-0.1.1.tar.gz.

File metadata

  • Download URL: pooled_job_scraper-0.1.1.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pooled_job_scraper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 26d4ae1e384ff704023d4131a1b00d7721f2da7fbe9470d0fc04b99c561c1ed4
MD5 b6789799db71b1042f38fd3045cd5ea1
BLAKE2b-256 5b8f5d7e6b0e0cee956bd4fb8a90cf5e127aa60cda97632aba7c80cbae6fbb6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.1.tar.gz:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pooled_job_scraper-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pooled_job_scraper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1ad35268da8866166d989d6b95d0b2f769c77f0cc62175ad03b2936210115c60
MD5 2eb8bbb711eebfc8f6f16e6c7023b7fa
BLAKE2b-256 cfabb5b36129142f6b121556d4c2ed8eb870383de1104b99c737ba9582d64ace

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.1-py3-none-any.whl:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page