Job scraping & AI enrichment engine for African job markets
Project description
ijobs-scraper
Job scraping & AI enrichment engine for African job markets.
Features
- 13 built-in adapters covering Kenyan job portals — API, HTML, and browser-based sources
- AI enrichment via OpenAI structured outputs — extracts titles, skills, salary, categories, and more from raw listings
- 3-layer deduplication — source URL uniqueness, SHA-256 content hashing, and optional fuzzy matching
- Async-first — all I/O uses
async/awaitwithhttpxfor HTTP andplaywrightfor browser automation - Framework-agnostic — zero coupling to FastAPI, Django, or any web framework. Integrate anywhere via Protocol-based dependency injection
- Type-safe — Pydantic v2 models throughout, passes
mypy --strict - Cron scheduling — built-in schedule evaluation via
croniter. Your app decides when and how to enqueue - Open-source — MIT licensed, designed for contributors to add new portal adapters in ~50 lines of code
Quick Start
pip install ijobs-scraper
import asyncio
from ijobs_scraper import ScraperEngine, SourceConfig, SourceType
class StubAIProvider:
"""Minimal AI provider for testing — returns mock enriched data."""
async def structured_extract(self, system_prompt, user_prompt, json_schema):
return {
"title": "Software Engineer",
"description": "A great role.",
"company_name": "One Acre Fund",
"company_website": None,
"location": "Nairobi, Kenya",
"remote_type": "hybrid",
"employment_type": "full_time",
"experience_level": "Mid-level",
"salary_min": None,
"salary_max": None,
"currency": "KES",
"skills": ["Python", "SQL"],
"benefits": [],
"category": "technology-engineering",
"requirements": None,
"posted_at": None,
"expires_at": None,
}
async def main():
engine = ScraperEngine(ai_provider=StubAIProvider())
source = SourceConfig(
name="One Acre Fund",
slug="one-acre-fund",
adapter="greenhouse",
source_type=SourceType.API,
base_url="https://boards-api.greenhouse.io",
config={"board_token": "oneacrefund"},
)
result = await engine.scrape_source(source)
print(f"Found: {result.jobs_found} | Created: {result.jobs_created} | Duplicates: {result.jobs_duplicated}")
asyncio.run(main())
Supported Sources
| Source | Adapter | Type | Reusable | Status |
|---|---|---|---|---|
| Kenya Airways | kenya_airways |
API | No | Stable |
| One Acre Fund | greenhouse |
API | Yes | Stable |
| Amref Health Africa | smartrecruiters |
API | Yes | Stable |
| Careerjet Kenya | careerjet |
API | Yes | Stable |
| ReliefWeb | reliefweb |
API | No | Stable |
| BrighterMonday | brightermonday |
HTML | No | Stable |
| MyJobMag Kenya | myjobmag |
HTML | No | Stable |
| MyGov Kenya | mygov |
HTML | No | Stable |
| Fuzu Kenya | fuzu |
HTML | No | Stable |
| KCB Bank | kcb |
HTML | No | Stable |
| Absa Bank | workday |
Browser | Yes | Stable |
| NCBA Bank | workday |
Browser | Yes | Stable |
| Impactpool | impactpool |
Browser | No | Stable |
| World Vision | world_vision |
Browser | No | Stable |
14 sources, 13 adapters — Absa and NCBA share the reusable
workdayadapter with different config.
Reusable adapters work with any employer on the same platform. For example, the greenhouse adapter works for any company using Greenhouse by changing the board_token config.
How It Works
The scraping pipeline follows a linear flow:
Source → Adapter → RawListing → AI Enrichment → Dedup → EnrichedJob
- The Adapter fetches raw job listings from a portal (via REST API, HTML parsing, or browser automation)
- The Enrichment Pipeline sends raw content to your AI provider with a strict JSON schema, producing structured job data
- The Dedup Engine checks for duplicates across three layers: source URL uniqueness, SHA-256 content hashing, and optional fuzzy matching
- The Engine emits each unique
EnrichedJobthrough your callback
See the full architecture doc for detailed diagrams and specifications.
Architecture
ijobs-scraper uses Protocol-based dependency injection to stay completely decoupled from any framework. Your application provides three interfaces:
AIProvider— wraps your OpenAI (or any LLM) calls for structured extractionStorageBackend— provides persistence for deduplication (known URLs, content hashes)JobCallback— handles each enriched job (save to DB, send notification, index for search)
The Adapter Registry maps adapter names to classes and auto-detects adapters from URLs for manual scraping. Each adapter is a self-contained class registered via decorator.
For the complete architecture specification, see docs/scraper-engine.md.
Usage Examples
1. Auto-scrape a source
import asyncio
from ijobs_scraper import ScraperEngine, SourceConfig, SourceType
engine = ScraperEngine(ai_provider=my_ai_provider, storage=my_storage, on_job=my_callback)
source = SourceConfig(
name="Amref Health Africa",
slug="amref",
adapter="smartrecruiters",
source_type=SourceType.API,
base_url="https://api.smartrecruiters.com/v1/companies/AmrefHealthAfrica4",
config={"company_slug": "AmrefHealthAfrica4"},
)
result = asyncio.run(engine.scrape_source(source))
print(f"Status: {result.status} | Created: {result.jobs_created}")
2. Parse a single URL
import asyncio
from ijobs_scraper import ScraperEngine
engine = ScraperEngine(ai_provider=my_ai_provider)
# Auto-detects the adapter from the URL domain
job = asyncio.run(engine.parse_url("https://boards.greenhouse.io/oneacrefund/jobs/12345"))
print(f"{job.title} at {job.company_name} — {job.location}")
3. Integrate with FastAPI
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from ijobs_scraper import ScraperEngine, EnrichedJob
app = FastAPI()
engine = ScraperEngine(ai_provider=my_ai_provider)
class ParseRequest(BaseModel):
url: str
hint: str | None = None
@app.post("/scraper/parse-url", response_model=EnrichedJob)
async def parse_url(request: ParseRequest):
try:
return await engine.parse_url(request.url, hint=request.hint)
except Exception as e:
raise HTTPException(status_code=422, detail=str(e))
Adding an Adapter
Adding support for a new job portal takes ~50 lines of code:
- Create the adapter file in the correct directory (
api/,html/, orbrowser/) - Subclass the right base:
APIAdapter,HTMLAdapter, orBrowserAdapter - Implement
fetch_listings()andcan_handle_url() - Register with
@AdapterRegistry.register("your_adapter") - Add tests in
tests/adapters/
from ijobs_scraper.adapters.base import APIAdapter
from ijobs_scraper._registry import AdapterRegistry
from ijobs_scraper.models import RawListing, SourceConfig
@AdapterRegistry.register("my_portal")
class MyPortalAdapter(APIAdapter):
async def fetch_listings(self, config: SourceConfig):
data = await self._get(f"{config.base_url}/jobs")
for item in data["results"]:
yield RawListing(
external_id=str(item["id"]),
external_url=item["url"],
title=item.get("title"),
raw_json=item,
company_name=config.name,
)
def can_handle_url(self, url: str) -> bool:
return "myportal.com" in url
See the full adapter guide and docs/adding-an-adapter.md for detailed instructions.
Configuration
SourceConfig
| Field | Type | Description |
|---|---|---|
name |
str |
Human-readable source name (e.g., "Kenya Airways") |
slug |
str |
URL-safe identifier (e.g., "kenya-airways") |
adapter |
str |
Registered adapter name (e.g., "greenhouse") |
source_type |
SourceType |
One of: api, html, browser, rss |
base_url |
str |
Base URL for the source |
cron_schedule |
str | None |
Cron expression for auto-scraping (e.g., "0 */6 * * *") |
is_active |
bool |
Whether this source is enabled (default: True) |
config |
dict |
Adapter-specific configuration (see below) |
Adapter-specific config examples
# Greenhouse — any employer using Greenhouse
SourceConfig(
adapter="greenhouse",
config={"board_token": "oneacrefund"},
...
)
# SmartRecruiters — any employer using SmartRecruiters
SourceConfig(
adapter="smartrecruiters",
config={"company_slug": "AmrefHealthAfrica4"},
...
)
# Workday — any employer using Workday (Absa, NCBA, etc.)
SourceConfig(
adapter="workday",
config={"tenant": "absa", "instance": "AbsaCareers"},
...
)
# Careerjet — meta-aggregator covering 60+ sites (v4 API, requires API key)
SourceConfig(
adapter="careerjet",
config={"api_key": os.environ["CAREERJET_API_KEY"], "location": "Kenya"},
...
)
Required environment variables
Some adapters require API keys. Add these to your .env.local or .env file:
# Careerjet v4 API — register at https://www.careerjet.co.ke/partners/register/as-publisher
CAREERJET_API_KEY=your_publisher_api_key_here
The Careerjet adapter reads api_key from SourceConfig.config. In your integration, load it from the environment:
import os
SourceConfig(
name="Careerjet Kenya",
slug="careerjet-kenya",
adapter="careerjet",
source_type=SourceType.API,
base_url="https://www.careerjet.co.ke",
config={"api_key": os.environ["CAREERJET_API_KEY"], "location": "Kenya"},
)
API Reference
All public exports are available from the top-level package:
from ijobs_scraper import (
ScraperEngine, # Main orchestrator — scrape_source(), parse_url(), scrape_all()
SourceConfig, # Source configuration model
SourceType, # Enum: api, html, browser, rss
RawListing, # Raw scraped listing before enrichment
EnrichedJob, # AI-enriched structured job data
JobRequirements, # Education, experience, certifications, languages
ScrapeResult, # Scrape run statistics and status
AIProvider, # Protocol: host app implements AI extraction
StorageBackend, # Protocol: host app implements persistence
JobCallback, # Protocol: host app handles enriched jobs
AdapterRegistry, # Adapter registration and lookup
BaseAdapter, # Abstract base for all adapters
get_due_sources, # Cron schedule evaluation
ScraperError, # Base exception
AdapterError, # Adapter-specific errors
EnrichmentError, # AI enrichment failures
DuplicateJobError, # Dedup detection
RateLimitError, # HTTP 429 / rate limiting
)
For full API documentation, see docs/api-reference.md.
Contributing
Contributions are welcome! We especially encourage adapter PRs for new job portals.
- See CONTRIBUTING.md for development setup, code style, and PR guidelines
- See docs/adding-an-adapter.md for the step-by-step adapter guide
- Browse issues labeled
good first issuefor starter tasks
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ijobs_scraper-0.1.2.tar.gz.
File metadata
- Download URL: ijobs_scraper-0.1.2.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9742b2f36ff7eba45c6f6fe67efd44c2ba552aecb655299193c64b358548a028
|
|
| MD5 |
bb7b21dd61fbccb9ab1dc1a870a4924f
|
|
| BLAKE2b-256 |
fcf03c526dbdda584b706abd5c0d8d2d161d50cae97bfcb5c3d0c7e1a2fe004f
|
File details
Details for the file ijobs_scraper-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ijobs_scraper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 50.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
032550aeedd1e91bcb742d19afdbb96678a98da1414188081a25a80e94175aa5
|
|
| MD5 |
beeacb07a31c3f72dc1241e751932a3b
|
|
| BLAKE2b-256 |
cbf6de7f79ce393a9cc03a54a6084b5e6786d03c45f54d35c2297aaa4ce7b035
|