Skip to main content

Clean, filter, label, and rate job description data with heuristics and local LLMs.

Project description

HonestRoles

Clean, filter, label, and rate job description data using heuristics and local LLMs.

HonestRoles is a Python package designed to transform raw job posting data into structured, scored, and searchable datasets. It provides a modular pipeline for normalization, high-performance filtering, and automated labeling using both traditional heuristics and local LLMs (Ollama).

Features

  • 🧹 Clean: HTML stripping, location normalization (city/region/country), salary parsing, and record deduplication.
  • 🔍 Filter: High-performance FilterChain with predicates for location, salary, skills, and keyword matching.
  • 🏷️ Label: Automated seniority detection, role categorization, and tech stack extraction.
  • ⭐️ Rate: Comprehensive job description scoring for completeness and quality.
  • 🤖 LLM Integration: seamless integration with local Ollama models (e.g., Llama 3) for deep semantic analysis.

Installation

pip install honestroles

For development:

git clone https://github.com/hypertrial/honestroles.git
cd honestroles
pip install -e ".[dev]"

Quickstart

import honestroles as hr
from honestroles import schema

# Load raw job data (Parquet or DuckDB)
df = hr.read_parquet("jobs_current.parquet")

# 1. Clean and normalize data
df = hr.clean_jobs(df)

# 2. Apply complex filtering
chain = hr.FilterChain()
chain.add(hr.filter.by_location, regions=["California", "New York"])
chain.add(hr.filter.by_salary, min_salary=120_000, currency="USD")
chain.add(hr.filter.by_skills, required=["Python", "React"])
df = chain.apply(df)

# 3. Label roles (Heuristics + LLM)
df = hr.label_jobs(df, use_llm=True, model="llama3")

# 4. Rate job quality
df = hr.rate_jobs(df)

# Access data using schema constants
print(df[[schema.TITLE, schema.CITY, schema.COUNTRY]].head())

# Save structured results
hr.write_parquet(df, "jobs_scored.parquet")

Contract-First Flow

For source data, use contract normalization + validation before processing:

import honestroles as hr

df = hr.read_parquet("jobs_current.parquet", validate=False)
df = hr.normalize_source_data_contract(df)
df = hr.validate_source_data_contract(df)

df = hr.clean_jobs(df)
df = hr.filter_jobs(df, remote_only=True)
df = hr.label_jobs(df, use_llm=False)
df = hr.rate_jobs(df, use_llm=False)

See /docs/quickstart_contract.md and /docs/source_data_contract_v1.md.

Documentation index: /docs/index.md.

Core Modules

Schema Constants

Always use honestroles.schema for consistent column referencing:

from honestroles import schema

# Available constants:
# schema.TITLE, schema.DESCRIPTION_TEXT, schema.COMPANY
# schema.CITY, schema.REGION, schema.COUNTRY
# schema.SALARY_MIN, schema.SALARY_MAX, etc.

Filtering with FilterChain

The FilterChain allows you to compose multiple filtering rules efficiently:

from honestroles import FilterChain, filter_jobs

# Functional approach:
df = filter_jobs(df, remote_only=True, min_salary=100_000)

# Composable approach:
chain = FilterChain()
chain.add(hr.filter.by_keywords, include=["Engineer"], exclude=["Manager"])
chain.add(hr.filter.by_completeness, required_fields=[schema.DESCRIPTION_TEXT, schema.APPLY_URL])
filtered_df = chain.apply(df)

Local LLM Usage (Ollama)

Ensure Ollama is running locally:

ollama serve
ollama pull llama3

Then enable LLM-based labeling or quality rating:

df = hr.label_jobs(df, use_llm=True, model="llama3")
df = hr.rate_jobs(df, use_llm=True, model="llama3")

Package Layout

src/honestroles/
├── clean/        # HTML stripping, normalization, and dedup
├── filter/       # Composed FilterChain and predicates
├── io/           # Parquet and DuckDB I/O with validation
├── label/        # Seniority, Category, and Tech Stack labeling
├── llm/          # Ollama client and prompt templates
├── rate/         # Completeness, Quality, and Composite ratings
└── schema.py     # Centralized column name constants

Testing

Run the test suite with pytest:

pytest

Stability

  • Changelog: /CHANGELOG.md
  • Performance guardrails: /docs/performance.md
  • Docs index: /docs/index.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

honestroles-0.1.0.tar.gz (66.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

honestroles-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file honestroles-0.1.0.tar.gz.

File metadata

  • Download URL: honestroles-0.1.0.tar.gz
  • Upload date:
  • Size: 66.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for honestroles-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f0c18137b2261eb67e88c223bc375dde536ff16845a7cbdcf787286d31631133
MD5 a708e61fc6c8b77ed6686397b09b3759
BLAKE2b-256 82d1f83d1d80c6e626f9ef4a945361f5fc37ea5c42346a1bde2119cc51ced393

See more details on using hashes here.

File details

Details for the file honestroles-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: honestroles-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for honestroles-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b30bf6518ffbdc369948be4384a64d9ca24e2f469d56ee54971a75251b40650
MD5 beb4a99b7b126e128cc569dbd37a5dab
BLAKE2b-256 a62f669952fe57b61a378f7afa29420b89ff7bbf624cee3bcdb87e51cf7297c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page