Skip to main content

A programmable job-scraping framework for India & global markets. Aggregates Naukri, Shine, Internshala, LinkedIn, Indeed, and FAANG companies into a unified dataset.

Project description

HireHunt

A programmable job-search aggregation framework for India and global markets.

HireHunt provides:

  • A stable normalized job schema.
  • Source-family registration with machine-readable capabilities and definitions.
  • Synchronous and asynchronous orchestration.
  • Configurable filtering, ranking, deduplication, retry, and caching policies.
  • Per-source completion and filtering diagnostics.
  • Graceful partial results when one source fails.
  • Fixture-based parser contract tests, optional live validation, and benchmark reporting.

Sources

Source Region Type Method
naukri ๐Ÿ‡ฎ๐Ÿ‡ณ India Jobs REST API โ€” 15,000+ listings
shine ๐Ÿ‡ฎ๐Ÿ‡ณ India Jobs SSR JSON โ€” 17,000+ listings
internshala ๐Ÿ‡ฎ๐Ÿ‡ณ India Internships / Jobs HTML scraping
unstop ๐Ÿ‡ฎ๐Ÿ‡ณ India Hackathons / Competitions REST API
linkedin ๐ŸŒ Global Jobs Guest HTML API
indeed ๐ŸŒ Global Jobs GraphQL API
google_careers ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)
amazon ๐ŸŒ FAANG Jobs REST API
meta ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)
apple ๐ŸŒ FAANG Jobs LinkedIn (keyword search)
netflix ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)
microsoft ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)

Installation

pip install hirehunt

The primary import is hirehunt. A top-level jobhunter compatibility shim is also packaged for existing users.

Requirements: Python 3.10+


Quick Start

Python API

from hirehunt import scrape_jobs

# Search across India's top job boards
result = scrape_jobs(
    search_term="python developer",
    sources=["naukri", "shine", "internshala"],
    city="Bengaluru",
    results_wanted=50,
)

for job in result.jobs:
    print(job)
# Python Developer @ TCS | Bengaluru | naukri
# Python Developer @ Infosys | Bengaluru | shine

CLI

# India job search
hirehunt search "data scientist" --city Mumbai --source naukri --source shine

# Company-aware search
hirehunt search "software engineer" --company DRDO --source linkedin --country India

# Hackathons & competitions
hirehunt search "hackathon" --source unstop

# FAANG company jobs
hirehunt search "software engineer" --source google_careers --source amazon

# Expand a whole source family
hirehunt search "backend developer" --source-family aggregator --country India

# Benchmark a family
hirehunt benchmark "python developer" --source-family regional --country India --limit 10

# Fail on health regressions
hirehunt validate "software engineer" --source-family aggregator --country India --strict --min-parsed 1

# Export to CSV
hirehunt search "backend developer" --source naukri --source linkedin --csv jobs.csv

# Top 20 ranked results
hirehunt search "machine learning" --source naukri --source shine --top 20

Result Limits And Completion

results_wanted is a per-source policy:

results_wanted=50    # At most 50 parsed records per source
results_wanted=0     # Exhaustive mode
results_wanted=None  # Exhaustive mode

Exhaustive mode continues until the source returns no further results. Some sources cannot guarantee exhaustive public search. Inspect the completion metadata rather than assuming every result is complete:

for source, stats in result.stats.items():
    print(source, stats.completion, stats.completion_reason)

Completion values are exhausted, capped, partial, failed, or unknown. Broad exhaustive searches on Naukri or Shine can require many requests.

Python API Reference

scrape_jobs()

from hirehunt import scrape_jobs

result = scrape_jobs(
    search_term="python developer",   # What to search
    sources=["naukri", "shine"],      # Which sources (list or "auto")
    source_family="",                 # Optional family expansion, e.g. "aggregator"
    company="Acme",                   # Optional company intent
    city="Bengaluru",                 # City filter (optional)
    location="India",                 # Broader location (optional)
    country="India",                  # Country (optional)
    results_wanted=50,                # Max per source; None or 0 = exhaustive
    dedupe_mode="strict",             # "strict", "heuristic", or "none"
    job_kind="job",                   # "job", "internship", "hackathon"
    remote=None,                      # True = remote only
    work_mode=None,                   # "remote", "hybrid", "onsite", "unknown"
    salary_min=500000,                # Min salary in INR (optional)
    posted_within_days=30,            # Only jobs from last N days
    skills=["python", "django"],      # Skill filter (optional)
    experience_min=0,                 # Min years experience (optional)
    experience_max=5,                 # Max years experience (optional)
    request_policy={                  # Optional retry/rate policy
        "retries": 4,
        "timeout": 25,
        "backoff_base": 2,
        "min_delay": 0.2,
        "max_delay": 0.8,
    },
)

The return value is a ScrapeResult, not a bare list:

result.jobs
result.errors
result.warnings
result.partial
result.selected_sources
result.stats

Job Object

Every source returns the same normalized Job dataclass:

@dataclass
class Job:
    schema_version: ClassVar[str]  # currently "1.0"
    title: str
    company: str
    source: str
    job_url: str

    location: str
    city: str
    country: str
    work_mode: WorkMode         # "remote" | "hybrid" | "onsite" | "unknown"
    job_kind: JobKind           # "job" | "internship" | "hackathon" | "competition"

    salary: Money               # min_amount, max_amount, currency, period
    stipend: Money

    skills: list[str]
    experience_min: float | None
    experience_max: float | None
    description: str
    date_posted: str | None
    deadline: str | None        # for competitions/hackathons

    match_score: float          # 0.0โ€“100.0 after ranking

Job.to_dict() includes schema_version. Additive fields may be introduced without changing the meaning of existing fields. Breaking schema changes require a new schema version.

Source Diagnostics

Every SourceStats includes:

stats.fetched
stats.parsed
stats.found                 # Backward-compatible parsed count
stats.filtered_out
stats.kept
stats.duplicates
stats.errors
stats.requests
stats.completion
stats.completion_reason
stats.filter_reasons        # e.g. {"city_mismatch": 12}

If one source fails, successful source results are still returned and result.partial is set to True.

Source Capabilities

Sources declare supported countries, job kinds, native filters, pagination, and exhaustive-search support:

from hirehunt.registry import default_registry

registry = default_registry()
print(registry.capabilities("naukri"))
print(registry.capabilities())  # all sources

Custom scrapers declare the same contract:

from hirehunt.models import JobKind, SourceCapabilities
from hirehunt.scrapers.base import BaseScraper

class MyScraper(BaseScraper):
    source = "my_source"
    capabilities = SourceCapabilities(
        countries=("India",),
        job_kinds=(JobKind.JOB,),
        supported_filters=frozenset({"city"}),
        pagination=True,
        exhaustive_search=True,
        description="My source adapter",
    )

    def search(self, query):
        ...

Source Definitions

The registry now exposes source-family metadata for config-driven expansion:

from hirehunt.registry import default_registry

registry = default_registry()
print(registry.families())                  # ['aggregator', 'company', 'opportunity', 'regional']
print(registry.family_sources("regional"))  # ['internshala', 'naukri', 'shine']
print(registry.definition("linkedin"))      # SourceDefinition(...)

Families are reusable framework groupings, not one-off scraper classes. New adapters such as workday, greenhouse, or institutional can slot into the same contract without changing SearchEngine.

Config-driven expansion uses register_configured_source() so one adapter can back many portals:

from hirehunt.registry import ScraperRegistry
from hirehunt.scrapers.base import BaseScraper

registry = ScraperRegistry()
registry.register_configured_source(
    MyWorkdayScraper,
    source="acme_workday",
    family="workday",
    aliases=("acme",),
    config={"tenant": "acme", "site": "Careers"},
)

Pluggable Policies

SearchEngine accepts a SearchPolicies bundle:

from hirehunt import SearchEngine
from hirehunt.policies import SearchPolicies
from hirehunt.query import JobQuery

engine = SearchEngine(
    policies=SearchPolicies(
        filtering=my_filter_policy,
        ranking=my_rank_policy,
        deduplication=my_dedupe_policy,
    )
)
query = JobQuery(search_term="backend developer", sources=["naukri", "shine"])
result = engine.search(query)

Policy contracts return FilterOutcome and DedupeOutcome, preserving diagnostics while allowing custom behavior.

Deduplication modes available through JobQuery:

  • strict: normalized URL, then source ID, then fallback identity.
  • heuristic: normalized title, company, location, and country across sources.
  • none: retain every parsed record.

Retry And Rate Policy

from hirehunt import JobQuery, RequestPolicy

query = JobQuery(
    search_term="backend developer",
    request_policy=RequestPolicy(
        retries=4,
        timeout=25,
        backoff_base=2,
        min_delay=0.2,
        max_delay=0.8,
    ),
)

Custom Cache Backend

Pass any object implementing get(source, key) and set(source, key, content, status_code=200):

query = JobQuery(
    search_term="python",
    cache_enabled=True,
    cache_backend=my_cache,
)

Export

from hirehunt import scrape_jobs
from hirehunt.exporters.csv import to_csv
from hirehunt.exporters.dataframe import to_dataframe
from hirehunt.exporters.json import to_json

result = scrape_jobs(search_term="python developer", sources=["naukri", "shine"])

to_csv(result.jobs, "jobs.csv")
to_json(result.jobs, "jobs.json")
df = to_dataframe(result.jobs)

Project Structure

hirehunt/
โ”œโ”€โ”€ __init__.py          # scrape_jobs() entry point
โ”œโ”€โ”€ models.py            # Job, Money, WorkMode, JobKind dataclasses
โ”œโ”€โ”€ query.py             # JobQuery โ€” unified search parameters
โ”œโ”€โ”€ engine.py            # Orchestrates parallel scraping + dedup
โ”œโ”€โ”€ registry.py          # Scraper registry + auto-source selection
โ”œโ”€โ”€ filtering.py         # Soft filtering (salary, city, skills, date)
โ”œโ”€โ”€ ranking.py           # Relevance scoring / match_score
โ”œโ”€โ”€ policies.py          # Injectable framework policy contracts
โ”œโ”€โ”€ validation.py        # Live source validation
โ”œโ”€โ”€ exceptions.py        # Custom exceptions
โ”œโ”€โ”€ cli.py               # `jobhunter` CLI entry point
โ”‚
โ”œโ”€โ”€ scrapers/
โ”‚   โ”œโ”€โ”€ base.py          # BaseScraper ABC
โ”‚   โ”œโ”€โ”€ naukri.py        # ๐Ÿ‡ฎ๐Ÿ‡ณ Naukri โ€” /jobapi/v2/search REST API
โ”‚   โ”œโ”€โ”€ shine.py         # ๐Ÿ‡ฎ๐Ÿ‡ณ Shine โ€” __NEXT_DATA__ SSR JSON
โ”‚   โ”œโ”€โ”€ internshala.py   # ๐Ÿ‡ฎ๐Ÿ‡ณ Internshala โ€” HTML + pagination
โ”‚   โ”œโ”€โ”€ unstop.py        # ๐Ÿ‡ฎ๐Ÿ‡ณ Unstop โ€” hackathons REST API
โ”‚   โ”œโ”€โ”€ linkedin.py      # ๐ŸŒ LinkedIn โ€” guest HTML API
โ”‚   โ”œโ”€โ”€ indeed.py        # ๐ŸŒ Indeed โ€” GraphQL API
โ”‚   โ””โ”€โ”€ faang.py         # ๐ŸŒ Google, Amazon, Meta, Apple, Netflix, Microsoft
โ”‚
โ”œโ”€โ”€ exporters/
โ”‚   โ”œโ”€โ”€ csv_exporter.py
โ”‚   โ”œโ”€โ”€ json_exporter.py
โ”‚   โ””โ”€โ”€ dataframe.py
โ”‚
โ””โ”€โ”€ utils/
    โ”œโ”€โ”€ fetchers.py      # CachedFetcher with proxy + backend support
    โ””โ”€โ”€ normalization.py # clean_text, parse_money, normalize_city, ...

tests/

Source Details

๐Ÿ‡ฎ๐Ÿ‡ณ Naukri

  • Endpoint: GET https://www.naukri.com/jobapi/v2/search
  • Auth: Session cookies from page warm-up (automatic)
  • Fields: Title, company, salary (LPA), location, skills, experience, date
  • Pagination: pageNo=N, 20 results/page, 3,000+ pages available

๐Ÿ‡ฎ๐Ÿ‡ณ Shine

  • Endpoint: __NEXT_DATA__ SSR JSON embedded in HTML
  • Fields: jJT (title), jCName (company), jSal (salary), jLoc (location), jKwd (skills), jPDate (date), jSlug (URL)
  • Pagination: path suffix -N, 20 results/page

๐Ÿ‡ฎ๐Ÿ‡ณ Internshala

  • Endpoint: HTML scraping โ€” div[id^='individual_internship_'][internshipid]
  • Pagination: ?page=N, 40+ cards/page
  • City filter: current SEO routes, e.g. /internships/python-internship-in-bangalore/

๐Ÿ‡ฎ๐Ÿ‡ณ Unstop

  • Endpoint: GET https://unstop.com/api/public/opportunity/search-result
  • Note: Returns hackathons, coding competitions, and challenges only
  • Fields: Title, organisation, skills, location, deadline, prize

๐ŸŒ Indeed

  • Endpoint: POST https://apis.indeed.com/graphql
  • Auth: Public API key (included)
  • Pagination: Cursor-based

๐ŸŒ LinkedIn

  • Endpoint: GET https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search
  • Auth: None โ€” guest API
  • FAANG filter: f_C company ID parameter

๐ŸŒ Amazon

  • Endpoint: GET https://www.amazon.jobs/en/search.json
  • Auth: None โ€” public REST API

Filtering

Most structured-data filters are soft: missing salary, skills, experience, or location data does not automatically remove a job. Explicit remote and posting date filters are strict.

result = scrape_jobs(
    "python developer",
    sources=["naukri", "shine"],
    salary_min=600_000,        # Only applied if salary data exists
    city="Bengaluru",          # Only applied if location data exists
    skills=["python", "sql"],  # Only applied if skills data exists
    posted_within_days=14,     # Missing or invalid dates are removed
)

Advanced Usage

FAANG-only search

from hirehunt import scrape_jobs
from hirehunt.registry import default_registry

registry = default_registry()
faang = registry.faang_sources()  # ['google_careers', 'amazon', 'meta', 'apple', 'netflix', 'microsoft']

result = scrape_jobs(
    search_term="software engineer",
    sources=faang,
    results_wanted=20,
)

Parallel scraping with custom config

result = scrape_jobs(
    search_term="backend developer",
    sources=["naukri", "shine", "linkedin"],
    city="Hyderabad",
    results_wanted=100,
    posted_within_days=7,
    cache_enabled=True,        # Cache responses locally
    proxies=["http://..."],    # Optional proxy list
)

Auto-source selection

# Automatically picks India job sources when country="India"
result = scrape_jobs(
    search_term="python developer",
    country="India",
    sources="auto",  # โ†’ [indeed, linkedin, internshala, naukri, shine]
)

Opportunity terms such as hackathon, competition, or challenge automatically add Unstop.

Testing And Validation

pip install -e .
python -m unittest discover -s tests -v
hirehunt validate "software developer" --city Bengaluru --country India
hirehunt benchmark "software developer" --source-family aggregator --country India --limit 5

Parser contracts use sanitized fixtures under tests/fixtures. Live validation is separate because remote sites can block, rate-limit, or change independently of deterministic unit tests.

CI runs the deterministic suite across Python 3.10-3.13 in .github/workflows/ci.yml. Scheduled source monitoring and parser-drift alerts run through .github/workflows/source-health.yml, which executes validate and benchmark for each source family and uploads the resulting JSON reports. .github/workflows/publish.yml gates release publication on tests, successful builds, and a wheel-install smoke run.

Compatibility

Existing public fields and calls remain supported:

  • scrape_jobs(**kwargs) and search_jobs(**kwargs).
  • result.jobs, result.errors, result.stats, and result.warnings.
  • SourceStats.found, kept, duplicates, and errors.
  • filter_jobs, rank_jobs, and deduplicate_jobs.

New metadata and policy APIs are additive.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hirehunt-0.4.0.tar.gz (56.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hirehunt-0.4.0-py3-none-any.whl (58.4 kB view details)

Uploaded Python 3

File details

Details for the file hirehunt-0.4.0.tar.gz.

File metadata

  • Download URL: hirehunt-0.4.0.tar.gz
  • Upload date:
  • Size: 56.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hirehunt-0.4.0.tar.gz
Algorithm Hash digest
SHA256 358cca46dce04f5911a0de81675afb3601d8c6fc97e7656f698285be4aa341d7
MD5 30d0aece48318243251636ad6f3db477
BLAKE2b-256 560979b4c765f0f317be069c1ba7ef39f270b31368f741d8c87f579c0ab2e92c

See more details on using hashes here.

File details

Details for the file hirehunt-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: hirehunt-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 58.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hirehunt-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d886a21b085dcb361eb4c38e46e28a03799eadfbd3f1bc02cb81745650e77cf
MD5 355480cd820de4f354aa9928ae5e82bd
BLAKE2b-256 7f584d869d5438474536c15f47d2a8b7420377045c4c4561828d16c4a6b7704b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page