Skip to main content

Career page scraping and ATS parsing library for job search automation

Project description

strata-harvest

CI Status: Pre-Alpha PyPI version Python versions License: MIT

Career page scraping and ATS parsing library. Point it at a company careers page, get back structured job listings — regardless of which applicant tracking system they use.

Every company posts jobs differently. Greenhouse uses a REST API. Lever has a JSON feed. Ashby hides behind GraphQL. Workday is... Workday. strata-harvest handles the detection and parsing so you don't have to reverse-engineer each one.

Why This Exists

Job data is fragmented across dozens of ATS platforms, each with its own page structure, API format, and quirks. If you're building anything that needs to read job listings programmatically — a job board, a recruiting tool, a market research pipeline — you hit the same wall: every career page is a snowflake.

strata-harvest solves this with a three-step approach:

  1. Detect — Identify the ATS provider from a URL using pattern matching and DOM probing
  2. Parse — Use the provider-specific parser (REST, JSON, GraphQL) to extract structured data
  3. Fall back — For unknown providers, use an optional LLM-based extractor that reads the page and returns structured listings anyway

The result is a single harvest(url) call that returns clean, typed job data from any career page.

Quick Start

import asyncio
from strata_harvest import harvest, create_crawler

async def main():
    # One-shot: get job listings from any career page
    listings = await harvest("https://boards.greenhouse.io/example/jobs")
    for job in listings:
        print(f"{job.title}{job.location}")

    # Reusable crawler with rate limiting and diagnostics
    crawler = create_crawler(rate_limit=2.0)
    result = await crawler.scrape("https://jobs.lever.co/example")
    print(f"Found {len(result.jobs)} jobs via {result.ats_info.provider}")
    if result.error:
        print(f"Warning: {result.error}")

asyncio.run(main())

Installation

pip install strata-harvest

For LLM-based fallback parsing (handles unknown ATS providers):

pip install strata-harvest[llm]

Requires Python 3.11+.

Features

  • ATS auto-detection — URL pattern matching and DOM probing identify the ATS provider with a confidence score, so you never need to specify it manually
  • Structured parsers — Dedicated parsers for Greenhouse (REST), Lever (JSON), and Ashby (GraphQL) that extract typed JobListing objects with normalized fields
  • LLM fallback — When no known ATS is detected, an optional LLM-based extractor reads the page and returns structured listings anyway (supports Gemini, OpenAI, Ollama, and any provider via LiteLLM)
  • Change detection — Content hashing lets you compare scrape results over time; pass a previous_hash to crawler.scrape() and check result.changed
  • Rate limiting — Built-in token-bucket rate limiter prevents overwhelming career page servers
  • Batch scrapingcrawler.scrape_batch() runs multiple URLs concurrently with configurable parallelism
  • Resilient HTTPsafe_fetch() never raises; transport errors surface as structured results with retry logic
  • Typed models — Pydantic v2 models (JobListing, ScrapeResult, ATSInfo) with full type safety

How It Works

URL → ATS Detection → Provider-Specific Parser → Structured JobListings
         │                     │
         │                     ├── Greenhouse (REST API)
         │                     ├── Lever (JSON API)
         │                     ├── Ashby (GraphQL)
         │                     ├── Workday (planned)
         │                     ├── iCIMS (planned)
         │                     └── Unknown → LLM fallback
         │
         └── Pattern matching + DOM probing
             Returns ATSInfo with provider + confidence score

ATS Detection

The detector identifies providers using URL patterns and DOM signatures, returning a confidence score. This means you don't need to know which ATS a company uses — just pass the careers URL.

from strata_harvest.detector import detect_ats

info = await detect_ats("https://boards.greenhouse.io/stripe/jobs")
print(info.provider)    # ATSProvider.GREENHOUSE
print(info.confidence)  # 0.95

Provider Parsers

Each supported ATS has a dedicated parser that knows how to call its API and normalize the response into JobListing objects:

Provider Detection Parsing API Type
Greenhouse URL + DOM Full REST (/embed/api/v1/jobs)
Lever URL + DOM Full JSON feed
Ashby URL + DOM Full GraphQL
Workday URL + DOM Planned
iCIMS URL + DOM Planned
Unknown LLM fallback Page content → structured extraction

LLM Fallback

When the detector can't identify the ATS, the optional LLM fallback reads the page content and extracts job listings using structured prompts. This handles the long tail of custom career pages and lesser-known ATS platforms.

crawler = create_crawler(llm_provider="gemini/gemini-2.0-flash")
result = await crawler.scrape("https://custom-careers-page.com/jobs")

Data Models

All parsed data uses typed Pydantic models:

from strata_harvest.models import JobListing, ScrapeResult, ATSInfo

# JobListing: title, url, location, department, description, requirements, salary_range, ...
# ScrapeResult: jobs, ats_info, error, scrape_duration_ms, content_hash, changed
# ATSInfo: provider, confidence, detection_method

Use Cases

  • Job search automation — Scrape target company career pages on a schedule, detect new postings, feed them into a matching pipeline
  • Recruiting intelligence — Monitor competitor hiring patterns, track which roles are open/closed over time, identify market signals
  • Job board aggregation — Build a focused job board for a niche (e.g., climate tech, AI/ML) by harvesting from curated company lists
  • HR analytics — Track time-to-fill by monitoring when listings appear and disappear, analyze job requirement trends across an industry
  • Salary benchmarking — Collect job descriptions at scale for compensation analysis and market positioning

Guides

  • Adding a New ATS Parser — Step-by-step guide for contributors
  • LLM Configuration — How to configure Gemini, OpenAI, Ollama, or any LiteLLM provider for fallback extraction
  • Advanced Usage — Custom crawlers, rate limiting, batch scraping, change detection, and proxy setup

Part of the Strata Ecosystem

strata-harvest is the data collection layer for Strata — an autonomous AI job search platform where specialized agents collaborate to discover, evaluate, and match job opportunities. In that context, strata-harvest feeds the Scraper Agent, which runs daily sweeps across target company career pages and routes new listings through a deduplication and matching pipeline.

But strata-harvest is fully standalone. It has no dependency on the Strata platform and works anywhere you need structured job data from career pages.

Development

Requires Python 3.11+ and uv (or pip/venv).

git clone https://github.com/andrewcrenshaw/strata-harvest.git
cd strata-harvest

# Install with dev dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Type check
uv run mypy src/strata_harvest

Adding a New Parser

Each ATS provider gets its own parser module in src/strata_harvest/parsers/. Parsers extend BaseParser and implement parse(content, *, url) -> list[JobListing]. See docs/adding-a-parser.md for the full walkthrough, or parsers/greenhouse.py for reference.

API Reference

API documentation is auto-generated from docstrings using mkdocs with the mkdocstrings plugin.

pip install -e ".[docs]"
mkdocs serve

Then open http://localhost:8000 to browse the full API reference.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strata_harvest-0.1.2.tar.gz (277.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strata_harvest-0.1.2-py3-none-any.whl (46.4 kB view details)

Uploaded Python 3

File details

Details for the file strata_harvest-0.1.2.tar.gz.

File metadata

  • Download URL: strata_harvest-0.1.2.tar.gz
  • Upload date:
  • Size: 277.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strata_harvest-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2587a44a81fc44fdd61b1641fe72fe3a0766fe5a9634dd6de1b93a6612e08ad5
MD5 4809e48111f58c585d848d46a407eaa1
BLAKE2b-256 f8df1abaf2e4384ef86c07cfc867657344ee238985f82d6d04de52e55b57cc80

See more details on using hashes here.

File details

Details for the file strata_harvest-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: strata_harvest-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 46.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strata_harvest-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 55c326cc4b3cc601ab2e10544fdea89a82dfe15852cb10ca04143183ac42cdcb
MD5 4f81ad2c440b37c77805381d764016c2
BLAKE2b-256 1acde52b7fc039d97d857f98265367909d67f077b2fd92ead25571d78ebbe16a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page