A programmable job-scraping framework for India & global markets. Aggregates Naukri, Shine, Internshala, LinkedIn, Indeed, and FAANG companies into a unified dataset.
Project description
HireHunt
A programmable job-search aggregation framework for India and global markets.
HireHunt provides:
- A stable normalized job schema.
- Source-family registration with machine-readable capabilities and definitions.
- Synchronous and asynchronous orchestration.
- Configurable filtering, ranking, deduplication, retry, and caching policies.
- Per-source completion and filtering diagnostics.
- Graceful partial results when one source fails.
- Fixture-based parser contract tests, optional live validation, and benchmark reporting.
Sources
| Source | Region | Type | Method |
|---|---|---|---|
naukri |
๐ฎ๐ณ India | Jobs | REST API โ 15,000+ listings |
shine |
๐ฎ๐ณ India | Jobs | SSR JSON โ 17,000+ listings |
internshala |
๐ฎ๐ณ India | Internships / Jobs | HTML scraping |
unstop |
๐ฎ๐ณ India | Hackathons / Competitions | REST API |
linkedin |
๐ Global | Jobs | Guest HTML API |
indeed |
๐ Global | Jobs | GraphQL API |
google_careers |
๐ FAANG | Jobs | LinkedIn (company-filtered) |
amazon |
๐ FAANG | Jobs | REST API |
meta |
๐ FAANG | Jobs | LinkedIn (company-filtered) |
apple |
๐ FAANG | Jobs | LinkedIn (keyword search) |
netflix |
๐ FAANG | Jobs | LinkedIn (company-filtered) |
microsoft |
๐ FAANG | Jobs | LinkedIn (company-filtered) |
Installation
pip install hirehunt
The primary import is hirehunt. A top-level jobhunter compatibility shim
is also packaged for existing users.
Requirements: Python 3.10+
Quick Start
Python API
from hirehunt import scrape_jobs
# Search across India's top job boards
result = scrape_jobs(
search_term="python developer",
sources=["naukri", "shine", "internshala"],
city="Bengaluru",
results_wanted=50,
)
for job in result.jobs:
print(job)
# Python Developer @ TCS | Bengaluru | naukri
# Python Developer @ Infosys | Bengaluru | shine
CLI
# India job search
hirehunt search "data scientist" --city Mumbai --source naukri --source shine
# Company-aware search
hirehunt search "software engineer" --company DRDO --source linkedin --country India
# Hackathons & competitions
hirehunt search "hackathon" --source unstop
# FAANG company jobs
hirehunt search "software engineer" --source google_careers --source amazon
# Expand a whole source family
hirehunt search "backend developer" --source-family aggregator --country India
# Benchmark a family
hirehunt benchmark "python developer" --source-family regional --country India --limit 10
# Fail on health regressions
hirehunt validate "software engineer" --source-family aggregator --country India --strict --min-parsed 1
# Export to CSV
hirehunt search "backend developer" --source naukri --source linkedin --csv jobs.csv
# Top 20 ranked results
hirehunt search "machine learning" --source naukri --source shine --top 20
Result Limits And Completion
results_wanted is a per-source policy:
results_wanted=50 # At most 50 parsed records per source
results_wanted=0 # Exhaustive mode
results_wanted=None # Exhaustive mode
Exhaustive mode continues until the source returns no further results. Some sources cannot guarantee exhaustive public search. Inspect the completion metadata rather than assuming every result is complete:
for source, stats in result.stats.items():
print(source, stats.completion, stats.completion_reason)
Completion values are exhausted, capped, partial, failed, or unknown.
Broad exhaustive searches on Naukri or Shine can require many requests.
Python API Reference
scrape_jobs()
from hirehunt import scrape_jobs
result = scrape_jobs(
search_term="python developer", # What to search
sources=["naukri", "shine"], # Which sources (list or "auto")
source_family="", # Optional family expansion, e.g. "aggregator"
company="Acme", # Optional company intent
city="Bengaluru", # City filter (optional)
location="India", # Broader location (optional)
country="India", # Country (optional)
results_wanted=50, # Max per source; None or 0 = exhaustive
dedupe_mode="strict", # "strict", "heuristic", or "none"
job_kind="job", # "job", "internship", "hackathon"
remote=None, # True = remote only
work_mode=None, # "remote", "hybrid", "onsite", "unknown"
salary_min=500000, # Min salary in INR (optional)
posted_within_days=30, # Only jobs from last N days
skills=["python", "django"], # Skill filter (optional)
experience_min=0, # Min years experience (optional)
experience_max=5, # Max years experience (optional)
request_policy={ # Optional retry/rate policy
"retries": 4,
"timeout": 25,
"backoff_base": 2,
"min_delay": 0.2,
"max_delay": 0.8,
},
)
The return value is a ScrapeResult, not a bare list:
result.jobs
result.errors
result.warnings
result.partial
result.selected_sources
result.stats
Job Object
Every source returns the same normalized Job dataclass:
@dataclass
class Job:
schema_version: ClassVar[str] # currently "1.0"
title: str
company: str
source: str
job_url: str
location: str
city: str
country: str
work_mode: WorkMode # "remote" | "hybrid" | "onsite" | "unknown"
job_kind: JobKind # "job" | "internship" | "hackathon" | "competition"
salary: Money # min_amount, max_amount, currency, period
stipend: Money
skills: list[str]
experience_min: float | None
experience_max: float | None
description: str
date_posted: str | None
deadline: str | None # for competitions/hackathons
match_score: float # 0.0โ100.0 after ranking
Job.to_dict() includes schema_version. Additive fields may be introduced
without changing the meaning of existing fields. Breaking schema changes
require a new schema version.
Source Diagnostics
Every SourceStats includes:
stats.fetched
stats.parsed
stats.found # Backward-compatible parsed count
stats.filtered_out
stats.kept
stats.duplicates
stats.errors
stats.requests
stats.completion
stats.completion_reason
stats.filter_reasons # e.g. {"city_mismatch": 12}
If one source fails, successful source results are still returned and
result.partial is set to True.
Source Capabilities
Sources declare supported countries, job kinds, native filters, pagination, and exhaustive-search support:
from hirehunt.registry import default_registry
registry = default_registry()
print(registry.capabilities("naukri"))
print(registry.capabilities()) # all sources
Custom scrapers declare the same contract:
from hirehunt.models import JobKind, SourceCapabilities
from hirehunt.scrapers.base import BaseScraper
class MyScraper(BaseScraper):
source = "my_source"
capabilities = SourceCapabilities(
countries=("India",),
job_kinds=(JobKind.JOB,),
supported_filters=frozenset({"city"}),
pagination=True,
exhaustive_search=True,
description="My source adapter",
)
def search(self, query):
...
Source Definitions
The registry now exposes source-family metadata for config-driven expansion:
from hirehunt.registry import default_registry
registry = default_registry()
print(registry.families()) # ['aggregator', 'company', 'opportunity', 'regional']
print(registry.family_sources("regional")) # ['internshala', 'naukri', 'shine']
print(registry.definition("linkedin")) # SourceDefinition(...)
Families are reusable framework groupings, not one-off scraper classes. New
adapters such as workday, greenhouse, or institutional can slot into the
same contract without changing SearchEngine.
Config-driven expansion uses register_configured_source() so one adapter can
back many portals:
from hirehunt.registry import ScraperRegistry
from hirehunt.scrapers.base import BaseScraper
registry = ScraperRegistry()
registry.register_configured_source(
MyWorkdayScraper,
source="acme_workday",
family="workday",
aliases=("acme",),
config={"tenant": "acme", "site": "Careers"},
)
Pluggable Policies
SearchEngine accepts a SearchPolicies bundle:
from hirehunt import SearchEngine
from hirehunt.policies import SearchPolicies
from hirehunt.query import JobQuery
engine = SearchEngine(
policies=SearchPolicies(
filtering=my_filter_policy,
ranking=my_rank_policy,
deduplication=my_dedupe_policy,
)
)
query = JobQuery(search_term="backend developer", sources=["naukri", "shine"])
result = engine.search(query)
Policy contracts return FilterOutcome and DedupeOutcome, preserving
diagnostics while allowing custom behavior.
Deduplication modes available through JobQuery:
strict: normalized URL, then source ID, then fallback identity.heuristic: normalized title, company, location, and country across sources.none: retain every parsed record.
Retry And Rate Policy
from hirehunt import JobQuery, RequestPolicy
query = JobQuery(
search_term="backend developer",
request_policy=RequestPolicy(
retries=4,
timeout=25,
backoff_base=2,
min_delay=0.2,
max_delay=0.8,
),
)
Custom Cache Backend
Pass any object implementing get(source, key) and
set(source, key, content, status_code=200):
query = JobQuery(
search_term="python",
cache_enabled=True,
cache_backend=my_cache,
)
Export
from hirehunt import scrape_jobs
from hirehunt.exporters.csv import to_csv
from hirehunt.exporters.dataframe import to_dataframe
from hirehunt.exporters.json import to_json
result = scrape_jobs(search_term="python developer", sources=["naukri", "shine"])
to_csv(result.jobs, "jobs.csv")
to_json(result.jobs, "jobs.json")
df = to_dataframe(result.jobs)
Project Structure
hirehunt/
โโโ __init__.py # scrape_jobs() entry point
โโโ models.py # Job, Money, WorkMode, JobKind dataclasses
โโโ query.py # JobQuery โ unified search parameters
โโโ engine.py # Orchestrates parallel scraping + dedup
โโโ registry.py # Scraper registry + auto-source selection
โโโ filtering.py # Soft filtering (salary, city, skills, date)
โโโ ranking.py # Relevance scoring / match_score
โโโ policies.py # Injectable framework policy contracts
โโโ validation.py # Live source validation
โโโ exceptions.py # Custom exceptions
โโโ cli.py # `jobhunter` CLI entry point
โ
โโโ scrapers/
โ โโโ base.py # BaseScraper ABC
โ โโโ naukri.py # ๐ฎ๐ณ Naukri โ /jobapi/v2/search REST API
โ โโโ shine.py # ๐ฎ๐ณ Shine โ __NEXT_DATA__ SSR JSON
โ โโโ internshala.py # ๐ฎ๐ณ Internshala โ HTML + pagination
โ โโโ unstop.py # ๐ฎ๐ณ Unstop โ hackathons REST API
โ โโโ linkedin.py # ๐ LinkedIn โ guest HTML API
โ โโโ indeed.py # ๐ Indeed โ GraphQL API
โ โโโ faang.py # ๐ Google, Amazon, Meta, Apple, Netflix, Microsoft
โ
โโโ exporters/
โ โโโ csv_exporter.py
โ โโโ json_exporter.py
โ โโโ dataframe.py
โ
โโโ utils/
โโโ fetchers.py # CachedFetcher with proxy + backend support
โโโ normalization.py # clean_text, parse_money, normalize_city, ...
tests/
Source Details
๐ฎ๐ณ Naukri
- Endpoint:
GET https://www.naukri.com/jobapi/v2/search - Auth: Session cookies from page warm-up (automatic)
- Fields: Title, company, salary (LPA), location, skills, experience, date
- Pagination:
pageNo=N, 20 results/page, 3,000+ pages available
๐ฎ๐ณ Shine
- Endpoint:
__NEXT_DATA__SSR JSON embedded in HTML - Fields:
jJT(title),jCName(company),jSal(salary),jLoc(location),jKwd(skills),jPDate(date),jSlug(URL) - Pagination: path suffix
-N, 20 results/page
๐ฎ๐ณ Internshala
- Endpoint: HTML scraping โ
div[id^='individual_internship_'][internshipid] - Pagination:
?page=N, 40+ cards/page - City filter: current SEO routes, e.g.
/internships/python-internship-in-bangalore/
๐ฎ๐ณ Unstop
- Endpoint:
GET https://unstop.com/api/public/opportunity/search-result - Note: Returns hackathons, coding competitions, and challenges only
- Fields: Title, organisation, skills, location, deadline, prize
๐ Indeed
- Endpoint:
POST https://apis.indeed.com/graphql - Auth: Public API key (included)
- Pagination: Cursor-based
๐ LinkedIn
- Endpoint:
GET https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search - Auth: None โ guest API
- FAANG filter:
f_Ccompany ID parameter
๐ Amazon
- Endpoint:
GET https://www.amazon.jobs/en/search.json - Auth: None โ public REST API
Filtering
Most structured-data filters are soft: missing salary, skills, experience, or location data does not automatically remove a job. Explicit remote and posting date filters are strict.
result = scrape_jobs(
"python developer",
sources=["naukri", "shine"],
salary_min=600_000, # Only applied if salary data exists
city="Bengaluru", # Only applied if location data exists
skills=["python", "sql"], # Only applied if skills data exists
posted_within_days=14, # Missing or invalid dates are removed
)
Advanced Usage
FAANG-only search
from hirehunt import scrape_jobs
from hirehunt.registry import default_registry
registry = default_registry()
faang = registry.faang_sources() # ['google_careers', 'amazon', 'meta', 'apple', 'netflix', 'microsoft']
result = scrape_jobs(
search_term="software engineer",
sources=faang,
results_wanted=20,
)
Parallel scraping with custom config
result = scrape_jobs(
search_term="backend developer",
sources=["naukri", "shine", "linkedin"],
city="Hyderabad",
results_wanted=100,
posted_within_days=7,
cache_enabled=True, # Cache responses locally
proxies=["http://..."], # Optional proxy list
)
Auto-source selection
# Automatically picks India job sources when country="India"
result = scrape_jobs(
search_term="python developer",
country="India",
sources="auto", # โ [indeed, linkedin, internshala, naukri, shine]
)
Opportunity terms such as hackathon, competition, or challenge
automatically add Unstop.
Testing And Validation
pip install -e .
python -m unittest discover -s tests -v
hirehunt validate "software developer" --city Bengaluru --country India
hirehunt benchmark "software developer" --source-family aggregator --country India --limit 5
Parser contracts use sanitized fixtures under tests/fixtures. Live validation
is separate because remote sites can block, rate-limit, or change independently
of deterministic unit tests.
CI runs the deterministic suite across Python 3.10-3.13 in
.github/workflows/ci.yml. Scheduled source monitoring and parser-drift alerts
run through .github/workflows/source-health.yml, which executes validate and
benchmark for each source family and uploads the resulting JSON reports.
.github/workflows/publish.yml gates release publication on tests, successful
builds, and a wheel-install smoke run.
Compatibility
Existing public fields and calls remain supported:
scrape_jobs(**kwargs)andsearch_jobs(**kwargs).result.jobs,result.errors,result.stats, andresult.warnings.SourceStats.found,kept,duplicates, anderrors.filter_jobs,rank_jobs, anddeduplicate_jobs.
New metadata and policy APIs are additive.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hirehunt-0.4.0.tar.gz.
File metadata
- Download URL: hirehunt-0.4.0.tar.gz
- Upload date:
- Size: 56.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
358cca46dce04f5911a0de81675afb3601d8c6fc97e7656f698285be4aa341d7
|
|
| MD5 |
30d0aece48318243251636ad6f3db477
|
|
| BLAKE2b-256 |
560979b4c765f0f317be069c1ba7ef39f270b31368f741d8c87f579c0ab2e92c
|
File details
Details for the file hirehunt-0.4.0-py3-none-any.whl.
File metadata
- Download URL: hirehunt-0.4.0-py3-none-any.whl
- Upload date:
- Size: 58.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d886a21b085dcb361eb4c38e46e28a03799eadfbd3f1bc02cb81745650e77cf
|
|
| MD5 |
355480cd820de4f354aa9928ae5e82bd
|
|
| BLAKE2b-256 |
7f584d869d5438474536c15f47d2a8b7420377045c4c4561828d16c4a6b7704b
|