Skip to main content

A programmable job-scraping framework for India & global markets. Aggregates Naukri, Shine, Internshala, LinkedIn, Indeed, and FAANG companies into a unified dataset.

Project description

๐ŸŽฏ HireHunt

A programmable job-scraping framework for India & global markets.
Aggregate jobs from 12 sources โ€” Naukri, Internshala, Shine, LinkedIn, Indeed, and FAANG companies โ€” into a unified, filterable, ranked dataset.


โœจ Sources

Source Region Type Method
naukri ๐Ÿ‡ฎ๐Ÿ‡ณ India Jobs REST API โ€” 15,000+ listings
shine ๐Ÿ‡ฎ๐Ÿ‡ณ India Jobs SSR JSON โ€” 17,000+ listings
internshala ๐Ÿ‡ฎ๐Ÿ‡ณ India Internships / Jobs HTML scraping
unstop ๐Ÿ‡ฎ๐Ÿ‡ณ India Hackathons / Competitions REST API
linkedin ๐ŸŒ Global Jobs Guest HTML API
indeed ๐ŸŒ Global Jobs GraphQL API
google_careers ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)
amazon ๐ŸŒ FAANG Jobs REST API
meta ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)
apple ๐ŸŒ FAANG Jobs LinkedIn (keyword search)
netflix ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)
microsoft ๐ŸŒ FAANG Jobs LinkedIn (company-filtered)

๐Ÿ“ฆ Installation

pip install hirehunt

Note: The PyPI package is hirehunt. The import name is jobhunter.

import jobhunter   # โ† this is correct after pip install hirehunt

Requirements: Python 3.10+


โšก Quick Start

Python API

from jobhunter import scrape_jobs

# Search across India's top job boards
jobs = scrape_jobs(
    search_term="python developer",
    sources=["naukri", "shine", "internshala"],
    city="Bengaluru",
    results_wanted=50,
)

for job in jobs:
    print(job)
# Python Developer @ TCS | Bengaluru | naukri
# Python Developer @ Infosys | Bengaluru | shine

CLI

# India job search
jobhunter search "data scientist" --city Mumbai --sources naukri,shine

# Hackathons & competitions
jobhunter search "hackathon" --sources unstop

# FAANG company jobs
jobhunter search "software engineer" --sources google_careers,amazon,netflix

# Export to CSV
jobhunter search "backend developer" --sources naukri,linkedin --output jobs.csv

# Top 20 ranked results
jobhunter search "machine learning" --sources naukri,shine,linkedin --top 20

๐Ÿ”ง Python API Reference

scrape_jobs()

from jobhunter import scrape_jobs

jobs = scrape_jobs(
    search_term="python developer",   # What to search
    sources=["naukri", "shine"],      # Which sources (list or "auto")
    city="Bengaluru",                 # City filter (optional)
    location="India",                 # Broader location (optional)
    country="India",                  # Country (optional)
    results_wanted=50,                # Max results per source
    job_kind="job",                   # "job", "internship", "hackathon"
    remote=None,                      # True = remote only
    salary_min=500000,                # Min salary in INR (optional)
    posted_within_days=30,            # Only jobs from last N days
    skills=["python", "django"],      # Skill filter (optional)
    experience_min=0,                 # Min years experience (optional)
    experience_max=5,                 # Max years experience (optional)
)

Job Object

Every source returns the same normalized Job dataclass:

@dataclass
class Job:
    title: str
    company: str
    source: str
    job_url: str

    location: str
    city: str
    country: str
    work_mode: WorkMode         # "remote" | "hybrid" | "onsite" | "unknown"
    job_kind: JobKind           # "job" | "internship" | "hackathon" | "competition"

    salary: Money               # min_amount, max_amount, currency, period
    stipend: Money

    skills: list[str]
    experience_min: float | None
    experience_max: float | None
    description: str
    date_posted: str | None
    deadline: str | None        # for competitions/hackathons

    match_score: float          # 0.0โ€“1.0 after ranking

Export

from jobhunter import scrape_jobs
from jobhunter.exporters import to_csv, to_json, to_dataframe

jobs = scrape_jobs("python developer", sources=["naukri", "shine"])

to_csv(jobs, "jobs.csv")
to_json(jobs, "jobs.json")
df = to_dataframe(jobs)   # pandas DataFrame

๐Ÿ—๏ธ Project Structure

jobhunter/
โ”œโ”€โ”€ __init__.py          # scrape_jobs() entry point
โ”œโ”€โ”€ models.py            # Job, Money, WorkMode, JobKind dataclasses
โ”œโ”€โ”€ query.py             # JobQuery โ€” unified search parameters
โ”œโ”€โ”€ engine.py            # Orchestrates parallel scraping + dedup
โ”œโ”€โ”€ registry.py          # Scraper registry + auto-source selection
โ”œโ”€โ”€ filtering.py         # Soft filtering (salary, city, skills, date)
โ”œโ”€โ”€ ranking.py           # Relevance scoring / match_score
โ”œโ”€โ”€ validation.py        # Input validation
โ”œโ”€โ”€ exceptions.py        # Custom exceptions
โ”œโ”€โ”€ cli.py               # `jobhunter` CLI entry point
โ”‚
โ”œโ”€โ”€ scrapers/
โ”‚   โ”œโ”€โ”€ base.py          # BaseScraper ABC
โ”‚   โ”œโ”€โ”€ naukri.py        # ๐Ÿ‡ฎ๐Ÿ‡ณ Naukri โ€” /jobapi/v2/search REST API
โ”‚   โ”œโ”€โ”€ shine.py         # ๐Ÿ‡ฎ๐Ÿ‡ณ Shine โ€” __NEXT_DATA__ SSR JSON
โ”‚   โ”œโ”€โ”€ internshala.py   # ๐Ÿ‡ฎ๐Ÿ‡ณ Internshala โ€” HTML + pagination
โ”‚   โ”œโ”€โ”€ unstop.py        # ๐Ÿ‡ฎ๐Ÿ‡ณ Unstop โ€” hackathons REST API
โ”‚   โ”œโ”€โ”€ linkedin.py      # ๐ŸŒ LinkedIn โ€” guest HTML API
โ”‚   โ”œโ”€โ”€ indeed.py        # ๐ŸŒ Indeed โ€” GraphQL API
โ”‚   โ””โ”€โ”€ faang.py         # ๐ŸŒ Google, Amazon, Meta, Apple, Netflix, Microsoft
โ”‚
โ”œโ”€โ”€ exporters/
โ”‚   โ”œโ”€โ”€ csv_exporter.py
โ”‚   โ”œโ”€โ”€ json_exporter.py
โ”‚   โ””โ”€โ”€ dataframe.py
โ”‚
โ””โ”€โ”€ utils/
    โ”œโ”€โ”€ fetchers.py      # CachedFetcher with proxy + backend support
    โ””โ”€โ”€ normalization.py # clean_text, parse_money, normalize_city, ...

tests/

๐Ÿ” Source Details

๐Ÿ‡ฎ๐Ÿ‡ณ Naukri

  • Endpoint: GET https://www.naukri.com/jobapi/v2/search
  • Auth: Session cookies from page warm-up (automatic)
  • Fields: Title, company, salary (LPA), location, skills, experience, date
  • Pagination: pageNo=N, 20 results/page, 3,000+ pages available

๐Ÿ‡ฎ๐Ÿ‡ณ Shine

  • Endpoint: __NEXT_DATA__ SSR JSON embedded in HTML
  • Fields: jJT (title), jCName (company), jSal (salary), jLoc (location), jKwd (skills), jPDate (date), jSlug (URL)
  • Pagination: ?page=N, 20 results/page, 900+ pages

๐Ÿ‡ฎ๐Ÿ‡ณ Internshala

  • Endpoint: HTML scraping โ€” div[id^='individual_internship_'][internshipid]
  • Pagination: ?page=N, 40+ cards/page
  • City filter: URL slug e.g. /internships/python-intern-in-bengaluru/

๐Ÿ‡ฎ๐Ÿ‡ณ Unstop

  • Endpoint: GET https://unstop.com/api/public/opportunity/search-result
  • Note: Returns hackathons, coding competitions, and challenges only
  • Fields: Title, organisation, skills, location, deadline, prize

๐ŸŒ Indeed

  • Endpoint: POST https://apis.indeed.com/graphql
  • Auth: Public API key (included)
  • Pagination: Cursor-based

๐ŸŒ LinkedIn

  • Endpoint: GET https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search
  • Auth: None โ€” guest API
  • FAANG filter: f_C company ID parameter

๐ŸŒ Amazon

  • Endpoint: GET https://www.amazon.jobs/en/search.json
  • Auth: None โ€” public REST API

โš™๏ธ Filtering

Filters are soft by default โ€” jobs missing a field pass through rather than being dropped:

jobs = scrape_jobs(
    "python developer",
    sources=["naukri", "shine"],
    salary_min=600_000,        # Only applied if salary data exists
    city="Bengaluru",          # Only applied if location data exists
    skills=["python", "sql"],  # Only applied if skills data exists
    posted_within_days=14,     # Only applied if date data exists
)

๐Ÿš€ Advanced Usage

FAANG-only search

from jobhunter import scrape_jobs
from jobhunter.registry import default_registry

registry = default_registry()
faang = registry.faang_sources()  # ['google_careers', 'amazon', 'meta', 'apple', 'netflix', 'microsoft']

jobs = scrape_jobs(
    search_term="software engineer",
    sources=faang,
    results_wanted=20,
)

Parallel scraping with custom config

jobs = scrape_jobs(
    search_term="backend developer",
    sources=["naukri", "shine", "linkedin"],
    city="Hyderabad",
    results_wanted=100,
    posted_within_days=7,
    cache_enabled=True,        # Cache responses locally
    proxies=["http://..."],    # Optional proxy list
)

Auto-source selection

# Automatically picks India sources when country="India"
jobs = scrape_jobs(
    search_term="python developer",
    country="India",
    sources="auto",  # โ†’ [indeed, linkedin, internshala, naukri, shine, unstop]
)

๐Ÿงช Running Tests

pip install -e .
pytest tests/

๐Ÿ“„ License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hirehunt-0.2.0.tar.gz (36.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hirehunt-0.2.0-py3-none-any.whl (44.8 kB view details)

Uploaded Python 3

File details

Details for the file hirehunt-0.2.0.tar.gz.

File metadata

  • Download URL: hirehunt-0.2.0.tar.gz
  • Upload date:
  • Size: 36.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hirehunt-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cbb72c54dd512bc272e649178b96d45692701e8cb56730fa41e5904a02f7e5b4
MD5 18f26e82274795d1904246b5c7d92f1d
BLAKE2b-256 004478162b92b4c66dd73ddaa63014c643830c36cb96b79f2c564ed8660dc5a3

See more details on using hashes here.

File details

Details for the file hirehunt-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: hirehunt-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 44.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hirehunt-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9358998af70489de945e1702535ebef458f93ca2e5cd7dec54439a18e1a2f4bf
MD5 44c474f73ee9cc7fa9fb213eb3834c54
BLAKE2b-256 3e449835db64ffcb065a2b36b38d4fc5d999826aaace9e6a9e6bcbb73d912ed0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page