Skip to main content

Generic job listings scraper with baseline dedupe and integration with the open jobpool

Project description

Pooled Job Scraper

pooled-job-scraper is published on PyPI and provides a CLI for scraping business careers pages, extracting listings, and generating a unique delta against the April 2026 baseline dataset.

PyPI package:

Platform Links

How They Fit Together

  • jobpool.live is the open data pool and hydration surface.
    Use this scraper to discover and normalize job listings, then review unique delta rows before promoting data into the pool workflow.
  • mewannajob.com is the consumer-facing experience for browsing and using listings data.
    Data prepared through the pool process ultimately supports downstream job discovery use cases there.

Documentation Surfaces

  • Public pool context and hydration navigation: jobpool.live
  • Hydration docs in this repository: pool/hydration/docs/
  • Scraper implementation in this repository: scripts/generic_job_listings_scraper.py

Install

Windows

py -m pip install --upgrade pooled-job-scraper

WSL / Linux / macOS

python3 -m pip install --upgrade pooled-job-scraper

Usage

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Default Cache Behavior

By default, unique rows are posted to:

  • https://jobpool.live/api/scrape-cache
job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

The scraper infers user_name from local environment or git config and sends:

  • user_name
  • request_timestamp
  • source_business_urls
  • listings (standard listing fields plus any additional discovered fields)

Disable Cache Submission

Use the only cache-related flag when you need to skip cache persistence:

job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --disable-cache \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv

Cache API

  • POST /api/scrape-cache stores a scrape request payload.
  • GET /api/scrape-cache?limit=25&user_name=<name> returns recent cached submissions.

Publishing Flow

Publishing is automated through:

  • .github/workflows/publish-pypi.yml

Behavior:

  • Triggers on push/merge to main.
  • Builds distributions from pyproject.toml.
  • Checks whether the current version already exists on PyPI.
  • Publishes only when the version is new.
  • Skips cleanly when that version already exists.

To release a new version:

  1. Bump project.version in pyproject.toml.
  2. Merge to main.
  3. Wait for the publish workflow to complete.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pooled_job_scraper-0.1.2.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pooled_job_scraper-0.1.2-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file pooled_job_scraper-0.1.2.tar.gz.

File metadata

  • Download URL: pooled_job_scraper-0.1.2.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pooled_job_scraper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d92f88312934b24f1e6dcee1e083079a055c07135186f20ae92e0d6f1bbc5152
MD5 6ba9a5368195699a73109a112dbb5c00
BLAKE2b-256 2586f0d7a8f553649c3dc45de51eb4e96951562016cf858df550c6f5b1a31a67

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.2.tar.gz:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pooled_job_scraper-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pooled_job_scraper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 302dd0b744658e3c6b5a7f199a2e1b96aecda726b929464c0a9afd86be2dcf4a
MD5 8d1db11d0b5466d51317f675700ee7c0
BLAKE2b-256 1e51a9010f7f39359ca87d7e74510d5c38a2ed4fe7a6c4f84c5415116adfce71

See more details on using hashes here.

Provenance

The following attestation bundles were made for pooled_job_scraper-0.1.2-py3-none-any.whl:

Publisher: publish-pypi.yml on lramos0/livejobpool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page