Job scraping & AI enrichment engine for African job markets

These details have not been verified by PyPI

Project links

Project description

ijobs-scraper

Job scraping & AI enrichment engine for African job markets.

Features

13 built-in adapters covering Kenyan job portals — API, HTML, and browser-based sources
AI enrichment via OpenAI structured outputs — extracts titles, skills, salary, categories, and more from raw listings
3-layer deduplication — source URL uniqueness, SHA-256 content hashing, and optional fuzzy matching
Async-first — all I/O uses async/await with httpx for HTTP and playwright for browser automation
Framework-agnostic — zero coupling to FastAPI, Django, or any web framework. Integrate anywhere via Protocol-based dependency injection
Type-safe — Pydantic v2 models throughout, passes mypy --strict
Cron scheduling — built-in schedule evaluation via croniter. Your app decides when and how to enqueue
Open-source — MIT licensed, designed for contributors to add new portal adapters in ~50 lines of code

Quick Start

pip install ijobs-scraper

import asyncio
from ijobs_scraper import ScraperEngine, SourceConfig, SourceType


class StubAIProvider:
    """Minimal AI provider for testing — returns mock enriched data."""

    async def structured_extract(self, system_prompt, user_prompt, json_schema):
        return {
            "title": "Software Engineer",
            "description": "A great role.",
            "company_name": "One Acre Fund",
            "company_website": None,
            "location": "Nairobi, Kenya",
            "remote_type": "hybrid",
            "employment_type": "full_time",
            "experience_level": "Mid-level",
            "salary_min": None,
            "salary_max": None,
            "currency": "KES",
            "skills": ["Python", "SQL"],
            "benefits": [],
            "category": "technology-engineering",
            "requirements": None,
            "posted_at": None,
            "expires_at": None,
        }


async def main():
    engine = ScraperEngine(ai_provider=StubAIProvider())
    source = SourceConfig(
        name="One Acre Fund",
        slug="one-acre-fund",
        adapter="greenhouse",
        source_type=SourceType.API,
        base_url="https://boards-api.greenhouse.io",
        config={"board_token": "oneacrefund"},
    )
    result = await engine.scrape_source(source)
    print(f"Found: {result.jobs_found} | Created: {result.jobs_created} | Duplicates: {result.jobs_duplicated}")


asyncio.run(main())

Supported Sources

Source	Adapter	Type	Reusable	Status
Kenya Airways	`kenya_airways`	API	No	Stable
One Acre Fund	`greenhouse`	API	Yes	Stable
Amref Health Africa	`smartrecruiters`	API	Yes	Stable
Careerjet Kenya	`careerjet`	API	Yes	Stable
ReliefWeb	`reliefweb`	API	No	Stable
BrighterMonday	`brightermonday`	HTML	No	Stable
MyJobMag Kenya	`myjobmag`	HTML	No	Stable
MyGov Kenya	`mygov`	HTML	No	Stable
Fuzu Kenya	`fuzu`	HTML	No	Stable
KCB Bank	`kcb`	HTML	No	Stable
Absa Bank	`workday`	Browser	Yes	Stable
NCBA Bank	`workday`	Browser	Yes	Stable
Impactpool	`impactpool`	Browser	No	Stable
World Vision	`world_vision`	Browser	No	Stable

14 sources, 13 adapters — Absa and NCBA share the reusable workday adapter with different config.

Reusable adapters work with any employer on the same platform. For example, the greenhouse adapter works for any company using Greenhouse by changing the board_token config.

How It Works

The scraping pipeline follows a linear flow:

Source → Adapter → RawListing → AI Enrichment → Dedup → EnrichedJob

The Adapter fetches raw job listings from a portal (via REST API, HTML parsing, or browser automation)
The Enrichment Pipeline sends raw content to your AI provider with a strict JSON schema, producing structured job data
The Dedup Engine checks for duplicates across three layers: source URL uniqueness, SHA-256 content hashing, and optional fuzzy matching
The Engine emits each unique EnrichedJob through your callback

Data Flow

See the full architecture doc for detailed diagrams and specifications.

Architecture

ijobs-scraper uses Protocol-based dependency injection to stay completely decoupled from any framework. Your application provides three interfaces:

AIProvider — wraps your OpenAI (or any LLM) calls for structured extraction
StorageBackend — provides persistence for deduplication (known URLs, content hashes)
JobCallback — handles each enriched job (save to DB, send notification, index for search)

The Adapter Registry maps adapter names to classes and auto-detects adapters from URLs for manual scraping. Each adapter is a self-contained class registered via decorator.

For the complete architecture specification, see docs/scraper-engine.md.

Usage Examples

1. Auto-scrape a source

import asyncio
from ijobs_scraper import ScraperEngine, SourceConfig, SourceType

engine = ScraperEngine(ai_provider=my_ai_provider, storage=my_storage, on_job=my_callback)

source = SourceConfig(
    name="Amref Health Africa",
    slug="amref",
    adapter="smartrecruiters",
    source_type=SourceType.API,
    base_url="https://api.smartrecruiters.com/v1/companies/AmrefHealthAfrica4",
    config={"company_slug": "AmrefHealthAfrica4"},
)

result = asyncio.run(engine.scrape_source(source))
print(f"Status: {result.status} | Created: {result.jobs_created}")

2. Parse a single URL

import asyncio
from ijobs_scraper import ScraperEngine

engine = ScraperEngine(ai_provider=my_ai_provider)

# Auto-detects the adapter from the URL domain
job = asyncio.run(engine.parse_url("https://boards.greenhouse.io/oneacrefund/jobs/12345"))
print(f"{job.title} at {job.company_name} — {job.location}")

3. Integrate with FastAPI

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from ijobs_scraper import ScraperEngine, EnrichedJob

app = FastAPI()
engine = ScraperEngine(ai_provider=my_ai_provider)


class ParseRequest(BaseModel):
    url: str
    hint: str | None = None


@app.post("/scraper/parse-url", response_model=EnrichedJob)
async def parse_url(request: ParseRequest):
    try:
        return await engine.parse_url(request.url, hint=request.hint)
    except Exception as e:
        raise HTTPException(status_code=422, detail=str(e))

Adding an Adapter

Adding support for a new job portal takes ~50 lines of code:

Create the adapter file in the correct directory (api/, html/, or browser/)
Subclass the right base: APIAdapter, HTMLAdapter, or BrowserAdapter
Implement fetch_listings() and can_handle_url()
Register with @AdapterRegistry.register("your_adapter")
Add tests in tests/adapters/

from ijobs_scraper.adapters.base import APIAdapter
from ijobs_scraper._registry import AdapterRegistry
from ijobs_scraper.models import RawListing, SourceConfig


@AdapterRegistry.register("my_portal")
class MyPortalAdapter(APIAdapter):
    async def fetch_listings(self, config: SourceConfig):
        data = await self._get(f"{config.base_url}/jobs")
        for item in data["results"]:
            yield RawListing(
                external_id=str(item["id"]),
                external_url=item["url"],
                title=item.get("title"),
                raw_json=item,
                company_name=config.name,
            )

    def can_handle_url(self, url: str) -> bool:
        return "myportal.com" in url

See the full adapter guide and docs/adding-an-adapter.md for detailed instructions.

Configuration

SourceConfig

Field	Type	Description
`name`	`str`	Human-readable source name (e.g., "Kenya Airways")
`slug`	`str`	URL-safe identifier (e.g., "kenya-airways")
`adapter`	`str`	Registered adapter name (e.g., "greenhouse")
`source_type`	`SourceType`	One of: `api`, `html`, `browser`, `rss`
`base_url`	`str`	Base URL for the source
`cron_schedule`	`str \| None`	Cron expression for auto-scraping (e.g., `"0 /6 * *"`)
`is_active`	`bool`	Whether this source is enabled (default: `True`)
`config`	`dict`	Adapter-specific configuration (see below)

Adapter-specific config examples

# Greenhouse — any employer using Greenhouse
SourceConfig(
    adapter="greenhouse",
    config={"board_token": "oneacrefund"},
    ...
)

# SmartRecruiters — any employer using SmartRecruiters
SourceConfig(
    adapter="smartrecruiters",
    config={"company_slug": "AmrefHealthAfrica4"},
    ...
)

# Workday — any employer using Workday (Absa, NCBA, etc.)
SourceConfig(
    adapter="workday",
    config={"tenant": "absa", "instance": "AbsaCareers"},
    ...
)

# Careerjet — meta-aggregator covering 60+ sites (v4 API, requires API key)
SourceConfig(
    adapter="careerjet",
    config={"api_key": os.environ["CAREERJET_API_KEY"], "location": "Kenya"},
    ...
)

Required environment variables

Some adapters require API keys. Add these to your .env.local or .env file:

# Careerjet v4 API — register at https://www.careerjet.co.ke/partners/register/as-publisher
CAREERJET_API_KEY=your_publisher_api_key_here

The Careerjet adapter reads api_key from SourceConfig.config. In your integration, load it from the environment:

import os

SourceConfig(
    name="Careerjet Kenya",
    slug="careerjet-kenya",
    adapter="careerjet",
    source_type=SourceType.API,
    base_url="https://www.careerjet.co.ke",
    config={"api_key": os.environ["CAREERJET_API_KEY"], "location": "Kenya"},
)

API Reference

All public exports are available from the top-level package:

from ijobs_scraper import (
    ScraperEngine,       # Main orchestrator — scrape_source(), parse_url(), scrape_all()
    SourceConfig,        # Source configuration model
    SourceType,          # Enum: api, html, browser, rss
    RawListing,          # Raw scraped listing before enrichment
    EnrichedJob,         # AI-enriched structured job data
    JobRequirements,     # Education, experience, certifications, languages, responsibilities, qualifications
    ScrapeResult,        # Scrape run statistics and status
    AIProvider,          # Protocol: host app implements AI extraction
    StorageBackend,      # Protocol: host app implements persistence
    JobCallback,         # Protocol: host app handles enriched jobs
    AdapterRegistry,     # Adapter registration and lookup
    BaseAdapter,         # Abstract base for all adapters
    get_due_sources,     # Cron schedule evaluation
    ScraperError,        # Base exception
    AdapterError,        # Adapter-specific errors
    EnrichmentError,     # AI enrichment failures
    DuplicateJobError,   # Dedup detection
    RateLimitError,      # HTTP 429 / rate limiting
)

For full API documentation, see docs/api-reference.md.

Contributing

Contributions are welcome! We especially encourage adapter PRs for new job portals.

See CONTRIBUTING.md for development setup, code style, and PR guidelines
See docs/adding-an-adapter.md for the step-by-step adapter guide
Browse issues labeled good first issue for starter tasks

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Apr 16, 2026

0.1.2

Apr 8, 2026

0.1.1

Apr 8, 2026

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ijobs_scraper-0.1.3.tar.gz (1.9 MB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ijobs_scraper-0.1.3-py3-none-any.whl (52.7 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file ijobs_scraper-0.1.3.tar.gz.

File metadata

Download URL: ijobs_scraper-0.1.3.tar.gz
Upload date: Apr 16, 2026
Size: 1.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for ijobs_scraper-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`fb393ea130e5362d2a482ca81e29f3a6552f9be7a09459cd885c1f6498fe4e6f`
MD5	`bff391b1fb7e18c626ff5dbe54bba85a`
BLAKE2b-256	`328ceb13e4502b43ed696796a6ae668e411c0933dd69b60fdfc3188014691b70`

See more details on using hashes here.

File details

Details for the file ijobs_scraper-0.1.3-py3-none-any.whl.

File metadata

Download URL: ijobs_scraper-0.1.3-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 52.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for ijobs_scraper-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a417b9fd893d5fe1dfdea6b31c2a371651fe7f90c2ff0ec471f76df04db6df17`
MD5	`97075fc48da60e8c1f2d694896460883`
BLAKE2b-256	`604a8a8d7a1f2af72cc329046a18a48c469f7756ed8589a148c21f3a302f40ad`

See more details on using hashes here.

ijobs-scraper 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ijobs-scraper

Features

Quick Start

Supported Sources

How It Works

Architecture

Usage Examples

1. Auto-scrape a source

2. Parse a single URL

3. Integrate with FastAPI

Adding an Adapter

Configuration

SourceConfig

Adapter-specific config examples

Required environment variables

API Reference

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes