Skip to main content

Job scraping & AI enrichment engine for African job markets

Project description

ijobs-scraper

Job scraping & AI enrichment engine for African job markets.

PyPI version CI Python License: MIT Downloads

Features

  • 13 built-in adapters covering Kenyan job portals — API, HTML, and browser-based sources
  • AI enrichment via OpenAI structured outputs — extracts titles, skills, salary, categories, and more from raw listings
  • 3-layer deduplication — source URL uniqueness, SHA-256 content hashing, and optional fuzzy matching
  • Async-first — all I/O uses async/await with httpx for HTTP and playwright for browser automation
  • Framework-agnostic — zero coupling to FastAPI, Django, or any web framework. Integrate anywhere via Protocol-based dependency injection
  • Type-safe — Pydantic v2 models throughout, passes mypy --strict
  • Cron scheduling — built-in schedule evaluation via croniter. Your app decides when and how to enqueue
  • Open-source — MIT licensed, designed for contributors to add new portal adapters in ~50 lines of code

Quick Start

pip install ijobs-scraper
import asyncio
from ijobs_scraper import ScraperEngine, SourceConfig, SourceType


class StubAIProvider:
    """Minimal AI provider for testing — returns mock enriched data."""

    async def structured_extract(self, system_prompt, user_prompt, json_schema):
        return {
            "title": "Software Engineer",
            "description": "A great role.",
            "company_name": "One Acre Fund",
            "company_website": None,
            "location": "Nairobi, Kenya",
            "remote_type": "hybrid",
            "employment_type": "full_time",
            "experience_level": "Mid-level",
            "salary_min": None,
            "salary_max": None,
            "currency": "KES",
            "skills": ["Python", "SQL"],
            "benefits": [],
            "category": "technology-engineering",
            "requirements": None,
            "posted_at": None,
            "expires_at": None,
        }


async def main():
    engine = ScraperEngine(ai_provider=StubAIProvider())
    source = SourceConfig(
        name="One Acre Fund",
        slug="one-acre-fund",
        adapter="greenhouse",
        source_type=SourceType.API,
        base_url="https://boards-api.greenhouse.io",
        config={"board_token": "oneacrefund"},
    )
    result = await engine.scrape_source(source)
    print(f"Found: {result.jobs_found} | Created: {result.jobs_created} | Duplicates: {result.jobs_duplicated}")


asyncio.run(main())

Supported Sources

Source Adapter Type Reusable Status
Kenya Airways kenya_airways API No Stable
One Acre Fund greenhouse API Yes Stable
Amref Health Africa smartrecruiters API Yes Stable
Careerjet Kenya careerjet API Yes Stable
ReliefWeb reliefweb API No Stable
BrighterMonday brightermonday HTML No Stable
MyJobMag Kenya myjobmag HTML No Stable
MyGov Kenya mygov HTML No Stable
Fuzu Kenya fuzu HTML No Stable
KCB Bank kcb HTML No Stable
Absa Bank workday Browser Yes Stable
NCBA Bank workday Browser Yes Stable
Impactpool impactpool Browser No Stable
World Vision world_vision Browser No Stable

14 sources, 13 adapters — Absa and NCBA share the reusable workday adapter with different config.

Reusable adapters work with any employer on the same platform. For example, the greenhouse adapter works for any company using Greenhouse by changing the board_token config.

How It Works

The scraping pipeline follows a linear flow:

Source → Adapter → RawListing → AI Enrichment → Dedup → EnrichedJob
  1. The Adapter fetches raw job listings from a portal (via REST API, HTML parsing, or browser automation)
  2. The Enrichment Pipeline sends raw content to your AI provider with a strict JSON schema, producing structured job data
  3. The Dedup Engine checks for duplicates across three layers: source URL uniqueness, SHA-256 content hashing, and optional fuzzy matching
  4. The Engine emits each unique EnrichedJob through your callback

Data Flow

See the full architecture doc for detailed diagrams and specifications.

Architecture

ijobs-scraper uses Protocol-based dependency injection to stay completely decoupled from any framework. Your application provides three interfaces:

  • AIProvider — wraps your OpenAI (or any LLM) calls for structured extraction
  • StorageBackend — provides persistence for deduplication (known URLs, content hashes)
  • JobCallback — handles each enriched job (save to DB, send notification, index for search)

The Adapter Registry maps adapter names to classes and auto-detects adapters from URLs for manual scraping. Each adapter is a self-contained class registered via decorator.

For the complete architecture specification, see docs/scraper-engine.md.

Usage Examples

1. Auto-scrape a source

import asyncio
from ijobs_scraper import ScraperEngine, SourceConfig, SourceType

engine = ScraperEngine(ai_provider=my_ai_provider, storage=my_storage, on_job=my_callback)

source = SourceConfig(
    name="Amref Health Africa",
    slug="amref",
    adapter="smartrecruiters",
    source_type=SourceType.API,
    base_url="https://api.smartrecruiters.com/v1/companies/AmrefHealthAfrica4",
    config={"company_slug": "AmrefHealthAfrica4"},
)

result = asyncio.run(engine.scrape_source(source))
print(f"Status: {result.status} | Created: {result.jobs_created}")

2. Parse a single URL

import asyncio
from ijobs_scraper import ScraperEngine

engine = ScraperEngine(ai_provider=my_ai_provider)

# Auto-detects the adapter from the URL domain
job = asyncio.run(engine.parse_url("https://boards.greenhouse.io/oneacrefund/jobs/12345"))
print(f"{job.title} at {job.company_name}{job.location}")

3. Integrate with FastAPI

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from ijobs_scraper import ScraperEngine, EnrichedJob

app = FastAPI()
engine = ScraperEngine(ai_provider=my_ai_provider)


class ParseRequest(BaseModel):
    url: str
    hint: str | None = None


@app.post("/scraper/parse-url", response_model=EnrichedJob)
async def parse_url(request: ParseRequest):
    try:
        return await engine.parse_url(request.url, hint=request.hint)
    except Exception as e:
        raise HTTPException(status_code=422, detail=str(e))

Adding an Adapter

Adding support for a new job portal takes ~50 lines of code:

  1. Create the adapter file in the correct directory (api/, html/, or browser/)
  2. Subclass the right base: APIAdapter, HTMLAdapter, or BrowserAdapter
  3. Implement fetch_listings() and can_handle_url()
  4. Register with @AdapterRegistry.register("your_adapter")
  5. Add tests in tests/adapters/
from ijobs_scraper.adapters.base import APIAdapter
from ijobs_scraper._registry import AdapterRegistry
from ijobs_scraper.models import RawListing, SourceConfig


@AdapterRegistry.register("my_portal")
class MyPortalAdapter(APIAdapter):
    async def fetch_listings(self, config: SourceConfig):
        data = await self._get(f"{config.base_url}/jobs")
        for item in data["results"]:
            yield RawListing(
                external_id=str(item["id"]),
                external_url=item["url"],
                title=item.get("title"),
                raw_json=item,
                company_name=config.name,
            )

    def can_handle_url(self, url: str) -> bool:
        return "myportal.com" in url

See the full adapter guide and docs/adding-an-adapter.md for detailed instructions.

Configuration

SourceConfig

Field Type Description
name str Human-readable source name (e.g., "Kenya Airways")
slug str URL-safe identifier (e.g., "kenya-airways")
adapter str Registered adapter name (e.g., "greenhouse")
source_type SourceType One of: api, html, browser, rss
base_url str Base URL for the source
cron_schedule str | None Cron expression for auto-scraping (e.g., "0 */6 * * *")
is_active bool Whether this source is enabled (default: True)
config dict Adapter-specific configuration (see below)

Adapter-specific config examples

# Greenhouse — any employer using Greenhouse
SourceConfig(
    adapter="greenhouse",
    config={"board_token": "oneacrefund"},
    ...
)

# SmartRecruiters — any employer using SmartRecruiters
SourceConfig(
    adapter="smartrecruiters",
    config={"company_slug": "AmrefHealthAfrica4"},
    ...
)

# Workday — any employer using Workday (Absa, NCBA, etc.)
SourceConfig(
    adapter="workday",
    config={"tenant": "absa", "instance": "AbsaCareers"},
    ...
)

# Careerjet — meta-aggregator covering 60+ sites (v4 API, requires API key)
SourceConfig(
    adapter="careerjet",
    config={"api_key": os.environ["CAREERJET_API_KEY"], "location": "Kenya"},
    ...
)

Required environment variables

Some adapters require API keys. Add these to your .env.local or .env file:

# Careerjet v4 API — register at https://www.careerjet.co.ke/partners/register/as-publisher
CAREERJET_API_KEY=your_publisher_api_key_here

The Careerjet adapter reads api_key from SourceConfig.config. In your integration, load it from the environment:

import os

SourceConfig(
    name="Careerjet Kenya",
    slug="careerjet-kenya",
    adapter="careerjet",
    source_type=SourceType.API,
    base_url="https://www.careerjet.co.ke",
    config={"api_key": os.environ["CAREERJET_API_KEY"], "location": "Kenya"},
)

API Reference

All public exports are available from the top-level package:

from ijobs_scraper import (
    ScraperEngine,       # Main orchestrator — scrape_source(), parse_url(), scrape_all()
    SourceConfig,        # Source configuration model
    SourceType,          # Enum: api, html, browser, rss
    RawListing,          # Raw scraped listing before enrichment
    EnrichedJob,         # AI-enriched structured job data
    JobRequirements,     # Education, experience, certifications, languages
    ScrapeResult,        # Scrape run statistics and status
    AIProvider,          # Protocol: host app implements AI extraction
    StorageBackend,      # Protocol: host app implements persistence
    JobCallback,         # Protocol: host app handles enriched jobs
    AdapterRegistry,     # Adapter registration and lookup
    BaseAdapter,         # Abstract base for all adapters
    get_due_sources,     # Cron schedule evaluation
    ScraperError,        # Base exception
    AdapterError,        # Adapter-specific errors
    EnrichmentError,     # AI enrichment failures
    DuplicateJobError,   # Dedup detection
    RateLimitError,      # HTTP 429 / rate limiting
)

For full API documentation, see docs/api-reference.md.

Contributing

Contributions are welcome! We especially encourage adapter PRs for new job portals.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ijobs_scraper-0.1.2.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ijobs_scraper-0.1.2-py3-none-any.whl (50.6 kB view details)

Uploaded Python 3

File details

Details for the file ijobs_scraper-0.1.2.tar.gz.

File metadata

  • Download URL: ijobs_scraper-0.1.2.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for ijobs_scraper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9742b2f36ff7eba45c6f6fe67efd44c2ba552aecb655299193c64b358548a028
MD5 bb7b21dd61fbccb9ab1dc1a870a4924f
BLAKE2b-256 fcf03c526dbdda584b706abd5c0d8d2d161d50cae97bfcb5c3d0c7e1a2fe004f

See more details on using hashes here.

File details

Details for the file ijobs_scraper-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ijobs_scraper-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 50.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for ijobs_scraper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 032550aeedd1e91bcb742d19afdbb96678a98da1414188081a25a80e94175aa5
MD5 beeacb07a31c3f72dc1241e751932a3b
BLAKE2b-256 cbf6de7f79ce393a9cc03a54a6084b5e6786d03c45f54d35c2297aaa4ce7b035

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page