Career page scraping and ATS parsing library for job search automation
Project description
strata-harvest
Career page scraping and ATS parsing library. Point it at a company careers page, get back structured job listings — regardless of which applicant tracking system they use.
Every company posts jobs differently. Greenhouse uses a REST API. Lever has a JSON feed. Ashby hides behind GraphQL. Workday is... Workday. strata-harvest handles the detection and parsing so you don't have to reverse-engineer each one.
Why This Exists
Job data is fragmented across dozens of ATS platforms, each with its own page structure, API format, and quirks. If you're building anything that needs to read job listings programmatically — a job board, a recruiting tool, a market research pipeline — you hit the same wall: every career page is a snowflake.
strata-harvest solves this with a three-step approach:
- Detect — Identify the ATS provider from a URL using pattern matching and DOM probing
- Parse — Use the provider-specific parser (REST, JSON, GraphQL) to extract structured data
- Fall back — For unknown providers, use an optional LLM-based extractor that reads the page and returns structured listings anyway
The result is a single harvest(url) call that returns clean, typed job data from any career page.
Quick Start
import asyncio
from strata_harvest import harvest, create_crawler
async def main():
# One-shot: get job listings from any career page
listings = await harvest("https://boards.greenhouse.io/example/jobs")
for job in listings:
print(f"{job.title} — {job.location}")
# Reusable crawler with rate limiting and diagnostics
crawler = create_crawler(rate_limit=2.0)
result = await crawler.scrape("https://jobs.lever.co/example")
print(f"Found {len(result.jobs)} jobs via {result.ats_info.provider}")
if result.error:
print(f"Warning: {result.error}")
asyncio.run(main())
Installation
pip install strata-harvest
For LLM-based fallback parsing (handles unknown ATS providers):
pip install strata-harvest[llm]
Requires Python 3.11+.
Features
- ATS auto-detection — URL pattern matching and DOM probing identify the ATS provider with a confidence score, so you never need to specify it manually
- Structured parsers — Dedicated parsers for Greenhouse (REST), Lever (JSON), and Ashby (GraphQL) that extract typed
JobListingobjects with normalized fields - LLM fallback — When no known ATS is detected, an optional LLM-based extractor reads the page and returns structured listings anyway (supports Gemini, OpenAI, Ollama, and any provider via LiteLLM)
- Change detection — Content hashing lets you compare scrape results over time; pass a
previous_hashtocrawler.scrape()and checkresult.changed - Rate limiting — Built-in token-bucket rate limiter prevents overwhelming career page servers
- Batch scraping —
crawler.scrape_batch()runs multiple URLs concurrently with configurable parallelism - Resilient HTTP —
safe_fetch()never raises; transport errors surface as structured results with retry logic - Typed models — Pydantic v2 models (
JobListing,ScrapeResult,ATSInfo) with full type safety
How It Works
URL → ATS Detection → Provider-Specific Parser → Structured JobListings
│ │
│ ├── Greenhouse (REST API)
│ ├── Lever (JSON API)
│ ├── Ashby (GraphQL)
│ ├── Workday (planned)
│ ├── iCIMS (planned)
│ └── Unknown → LLM fallback
│
└── Pattern matching + DOM probing
Returns ATSInfo with provider + confidence score
ATS Detection
The detector identifies providers using URL patterns and DOM signatures, returning a confidence score. This means you don't need to know which ATS a company uses — just pass the careers URL.
from strata_harvest.detector import detect_ats
info = await detect_ats("https://boards.greenhouse.io/stripe/jobs")
print(info.provider) # ATSProvider.GREENHOUSE
print(info.confidence) # 0.95
Provider Parsers
Each supported ATS has a dedicated parser that knows how to call its API and normalize the response into JobListing objects:
| Provider | Detection | Parsing | API Type |
|---|---|---|---|
| Greenhouse | URL + DOM | Full | REST (/embed/api/v1/jobs) |
| Lever | URL + DOM | Full | JSON feed |
| Ashby | URL + DOM | Full | GraphQL |
| Workday | URL + DOM | Planned | — |
| iCIMS | URL + DOM | Planned | — |
| Unknown | — | LLM fallback | Page content → structured extraction |
LLM Fallback
When the detector can't identify the ATS, the optional LLM fallback reads the page content and extracts job listings using structured prompts. This handles the long tail of custom career pages and lesser-known ATS platforms.
crawler = create_crawler(llm_provider="gemini/gemini-2.0-flash")
result = await crawler.scrape("https://custom-careers-page.com/jobs")
Data Models
All parsed data uses typed Pydantic models:
from strata_harvest.models import JobListing, ScrapeResult, ATSInfo
# JobListing: title, url, location, department, description, requirements, salary_range, ...
# ScrapeResult: jobs, ats_info, error, scrape_duration_ms, content_hash, changed
# ATSInfo: provider, confidence, detection_method
Use Cases
- Job search automation — Scrape target company career pages on a schedule, detect new postings, feed them into a matching pipeline
- Recruiting intelligence — Monitor competitor hiring patterns, track which roles are open/closed over time, identify market signals
- Job board aggregation — Build a focused job board for a niche (e.g., climate tech, AI/ML) by harvesting from curated company lists
- HR analytics — Track time-to-fill by monitoring when listings appear and disappear, analyze job requirement trends across an industry
- Salary benchmarking — Collect job descriptions at scale for compensation analysis and market positioning
Guides
- Adding a New ATS Parser — Step-by-step guide for contributors
- LLM Configuration — How to configure Gemini, OpenAI, Ollama, or any LiteLLM provider for fallback extraction
- Advanced Usage — Custom crawlers, rate limiting, batch scraping, change detection, and proxy setup
Part of the Strata Ecosystem
strata-harvest is the data collection layer for Strata — an autonomous AI job search platform where specialized agents collaborate to discover, evaluate, and match job opportunities. In that context, strata-harvest feeds the Scraper Agent, which runs daily sweeps across target company career pages and routes new listings through a deduplication and matching pipeline.
But strata-harvest is fully standalone. It has no dependency on the Strata platform and works anywhere you need structured job data from career pages.
Development
Requires Python 3.11+ and uv (or pip/venv).
git clone https://github.com/andrewcrenshaw/strata-harvest.git
cd strata-harvest
# Install with dev dependencies
uv sync --all-extras
# Run tests
uv run pytest
# Lint
uv run ruff check .
# Type check
uv run mypy src/strata_harvest
Adding a New Parser
Each ATS provider gets its own parser module in src/strata_harvest/parsers/. Parsers extend BaseParser and implement parse(content, *, url) -> list[JobListing]. See docs/adding-a-parser.md for the full walkthrough, or parsers/greenhouse.py for reference.
API Reference
API documentation is auto-generated from docstrings using mkdocs with the mkdocstrings plugin.
pip install -e ".[docs]"
mkdocs serve
Then open http://localhost:8000 to browse the full API reference.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strata_harvest-0.1.5.tar.gz.
File metadata
- Download URL: strata_harvest-0.1.5.tar.gz
- Upload date:
- Size: 289.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6599312f3f0a6a9ade78809a25b0d5604aac79ec2bbebf7a9ce61b60e716bf4
|
|
| MD5 |
13cdd3b2b7e14be71e66152051d49b62
|
|
| BLAKE2b-256 |
f4ee952a34f1ddfd6684cb8abb328f432425ff35cf42d9f83ccd5fbbcc012567
|
File details
Details for the file strata_harvest-0.1.5-py3-none-any.whl.
File metadata
- Download URL: strata_harvest-0.1.5-py3-none-any.whl
- Upload date:
- Size: 56.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
edd44774bae0dd8bece000c32c7c0f6610c34c653a3593957ec73c75f906a543
|
|
| MD5 |
24cc21720307a079bdf7926433b13b8c
|
|
| BLAKE2b-256 |
905ac5640b2c90884bb8bd6f79f23259e12ab41845ca4111f56f4977f974334e
|