Career page scraping and ATS parsing library for job search automation

These details have not been verified by PyPI

Project links

Project description

strata-harvest

Career page scraping and ATS parsing library. Point it at a company careers page, get back structured job listings — regardless of which applicant tracking system they use.

Every company posts jobs differently. Greenhouse uses a REST API. Lever has a JSON feed. Ashby hides behind GraphQL. Workday is... Workday. strata-harvest handles the detection and parsing so you don't have to reverse-engineer each one.

Why This Exists

Job data is fragmented across dozens of ATS platforms, each with its own page structure, API format, and quirks. If you're building anything that needs to read job listings programmatically — a job board, a recruiting tool, a market research pipeline — you hit the same wall: every career page is a snowflake.

strata-harvest solves this with a three-step approach:

Detect — Identify the ATS provider from a URL using pattern matching and DOM probing
Parse — Use the provider-specific parser (REST, JSON, GraphQL) to extract structured data
Fall back — For unknown providers, use an optional LLM-based extractor that reads the page and returns structured listings anyway

The result is a single harvest(url) call that returns clean, typed job data from any career page.

Quick Start

import asyncio
from strata_harvest import harvest, create_crawler

async def main():
    # One-shot: get job listings from any career page
    listings = await harvest("https://boards.greenhouse.io/example/jobs")
    for job in listings:
        print(f"{job.title} — {job.location}")

    # Reusable crawler with rate limiting and diagnostics
    crawler = create_crawler(rate_limit=2.0)
    result = await crawler.scrape("https://jobs.lever.co/example")
    print(f"Found {len(result.jobs)} jobs via {result.ats_info.provider}")
    if result.error:
        print(f"Warning: {result.error}")

asyncio.run(main())

Installation

pip install strata-harvest

For LLM-based fallback parsing (handles unknown ATS providers):

pip install strata-harvest[llm]

Requires Python 3.11+.

Features

ATS auto-detection — URL pattern matching and DOM probing identify the ATS provider with a confidence score, so you never need to specify it manually
Structured parsers — Dedicated parsers for Greenhouse (REST), Lever (JSON), and Ashby (GraphQL) that extract typed JobListing objects with normalized fields
LLM fallback — When no known ATS is detected, an optional LLM-based extractor reads the page and returns structured listings anyway (supports Gemini, OpenAI, Ollama, and any provider via LiteLLM)
Change detection — Content hashing lets you compare scrape results over time; pass a previous_hash to crawler.scrape() and check result.changed
Rate limiting — Built-in token-bucket rate limiter prevents overwhelming career page servers
Batch scraping — crawler.scrape_batch() runs multiple URLs concurrently with configurable parallelism
Resilient HTTP — safe_fetch() never raises; transport errors surface as structured results with retry logic
Typed models — Pydantic v2 models (JobListing, ScrapeResult, ATSInfo) with full type safety

How It Works

URL → ATS Detection → Provider-Specific Parser → Structured JobListings
         │                     │
         │                     ├── Greenhouse (REST API)
         │                     ├── Lever (JSON API)
         │                     ├── Ashby (GraphQL)
         │                     ├── Workday (planned)
         │                     ├── iCIMS (planned)
         │                     └── Unknown → LLM fallback
         │
         └── Pattern matching + DOM probing
             Returns ATSInfo with provider + confidence score

ATS Detection

The detector identifies providers using URL patterns and DOM signatures, returning a confidence score. This means you don't need to know which ATS a company uses — just pass the careers URL.

from strata_harvest.detector import detect_ats

info = await detect_ats("https://boards.greenhouse.io/stripe/jobs")
print(info.provider)    # ATSProvider.GREENHOUSE
print(info.confidence)  # 0.95

Provider Parsers

Each supported ATS has a dedicated parser that knows how to call its API and normalize the response into JobListing objects:

Provider	Detection	Parsing	API Type
Greenhouse	URL + DOM	Full	REST (`/embed/api/v1/jobs`)
Lever	URL + DOM	Full	JSON feed
Ashby	URL + DOM	Full	GraphQL
Workday	URL + DOM	Planned	—
iCIMS	URL + DOM	Planned	—
Unknown	—	LLM fallback	Page content → structured extraction

LLM Fallback

When the detector can't identify the ATS, the optional LLM fallback reads the page content and extracts job listings using structured prompts. This handles the long tail of custom career pages and lesser-known ATS platforms.

crawler = create_crawler(llm_provider="gemini/gemini-2.0-flash")
result = await crawler.scrape("https://custom-careers-page.com/jobs")

Data Models

All parsed data uses typed Pydantic models:

from strata_harvest.models import JobListing, ScrapeResult, ATSInfo

# JobListing: title, url, location, department, description, requirements, salary_range, ...
# ScrapeResult: jobs, ats_info, error, scrape_duration_ms, content_hash, changed
# ATSInfo: provider, confidence, detection_method

Use Cases

Job search automation — Scrape target company career pages on a schedule, detect new postings, feed them into a matching pipeline
Recruiting intelligence — Monitor competitor hiring patterns, track which roles are open/closed over time, identify market signals
Job board aggregation — Build a focused job board for a niche (e.g., climate tech, AI/ML) by harvesting from curated company lists
HR analytics — Track time-to-fill by monitoring when listings appear and disappear, analyze job requirement trends across an industry
Salary benchmarking — Collect job descriptions at scale for compensation analysis and market positioning

Guides

Adding a New ATS Parser — Step-by-step guide for contributors
LLM Configuration — How to configure Gemini, OpenAI, Ollama, or any LiteLLM provider for fallback extraction
Advanced Usage — Custom crawlers, rate limiting, batch scraping, change detection, and proxy setup

Part of the Strata Ecosystem

strata-harvest is the data collection layer for Strata — an autonomous AI job search platform where specialized agents collaborate to discover, evaluate, and match job opportunities. In that context, strata-harvest feeds the Scraper Agent, which runs daily sweeps across target company career pages and routes new listings through a deduplication and matching pipeline.

But strata-harvest is fully standalone. It has no dependency on the Strata platform and works anywhere you need structured job data from career pages.

Development

Requires Python 3.11+ and uv (or pip/venv).

git clone https://github.com/andrewcrenshaw/strata-harvest.git
cd strata-harvest

# Install with dev dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Type check
uv run mypy src/strata_harvest

Adding a New Parser

Each ATS provider gets its own parser module in src/strata_harvest/parsers/. Parsers extend BaseParser and implement parse(content, *, url) -> list[JobListing]. See docs/adding-a-parser.md for the full walkthrough, or parsers/greenhouse.py for reference.

API Reference

API documentation is auto-generated from docstrings using mkdocs with the mkdocstrings plugin.

pip install -e ".[docs]"
mkdocs serve

Then open http://localhost:8000 to browse the full API reference.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.8

Apr 17, 2026

0.1.7

Apr 19, 2026

0.1.6

Apr 16, 2026

0.1.5

Apr 10, 2026

0.1.4

Apr 10, 2026

This version

0.1.2

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strata_harvest-0.1.2.tar.gz (277.1 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strata_harvest-0.1.2-py3-none-any.whl (46.4 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file strata_harvest-0.1.2.tar.gz.

File metadata

Download URL: strata_harvest-0.1.2.tar.gz
Upload date: Apr 9, 2026
Size: 277.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strata_harvest-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`2587a44a81fc44fdd61b1641fe72fe3a0766fe5a9634dd6de1b93a6612e08ad5`
MD5	`4809e48111f58c585d848d46a407eaa1`
BLAKE2b-256	`f8df1abaf2e4384ef86c07cfc867657344ee238985f82d6d04de52e55b57cc80`

See more details on using hashes here.

File details

Details for the file strata_harvest-0.1.2-py3-none-any.whl.

File metadata

Download URL: strata_harvest-0.1.2-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 46.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strata_harvest-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55c326cc4b3cc601ab2e10544fdea89a82dfe15852cb10ca04143183ac42cdcb`
MD5	`4f81ad2c440b37c77805381d764016c2`
BLAKE2b-256	`1acde52b7fc039d97d857f98265367909d67f077b2fd92ead25571d78ebbe16a`

See more details on using hashes here.

strata-harvest 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

strata-harvest

Why This Exists

Quick Start

Installation

Features

How It Works

ATS Detection

Provider Parsers

LLM Fallback

Data Models

Use Cases

Guides

Part of the Strata Ecosystem

Development

Adding a New Parser

API Reference

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes