Clean, filter, label, and rate job description data with heuristics and local LLMs.
Project description
HonestRoles
Clean, filter, label, and rate job description data using heuristics and local LLMs.
HonestRoles is a Python package designed to transform raw job posting data into structured, scored, and searchable datasets. It provides a modular pipeline for normalization, high-performance filtering, and automated labeling using both traditional heuristics and local LLMs (Ollama).
Features
- 🧹 Clean: HTML stripping, location normalization (city/region/country), salary parsing, and record deduplication.
- 🔍 Filter: High-performance
FilterChainwith predicates for location, salary, skills, and keyword matching. - 🏷️ Label: Automated seniority detection, role categorization, and tech stack extraction.
- ⭐️ Rate: Comprehensive job description scoring for completeness and quality.
- 🤖 LLM Integration: seamless integration with local Ollama models (e.g., Llama 3) for deep semantic analysis.
Installation
pip install honestroles
For development:
git clone https://github.com/hypertrial/honestroles.git
cd honestroles
pip install -e ".[dev]"
Quickstart
import honestroles as hr
from honestroles import schema
# Load raw job data (Parquet or DuckDB)
df = hr.read_parquet("jobs_current.parquet")
# 1. Clean and normalize data
df = hr.clean_jobs(df)
# 2. Apply complex filtering
chain = hr.FilterChain()
chain.add(hr.filter.by_location, regions=["California", "New York"])
chain.add(hr.filter.by_salary, min_salary=120_000, currency="USD")
chain.add(hr.filter.by_skills, required=["Python", "React"])
df = chain.apply(df)
# 3. Label roles (Heuristics + LLM)
df = hr.label_jobs(df, use_llm=True, model="llama3")
# 4. Rate job quality
df = hr.rate_jobs(df)
# Access data using schema constants
print(df[[schema.TITLE, schema.CITY, schema.COUNTRY]].head())
# Save structured results
hr.write_parquet(df, "jobs_scored.parquet")
Contract-First Flow
For source data, use contract normalization + validation before processing:
import honestroles as hr
df = hr.read_parquet("jobs_current.parquet", validate=False)
df = hr.normalize_source_data_contract(df)
df = hr.validate_source_data_contract(df)
df = hr.clean_jobs(df)
df = hr.filter_jobs(df, remote_only=True)
df = hr.label_jobs(df, use_llm=False)
df = hr.rate_jobs(df, use_llm=False)
See /docs/quickstart_contract.md and /docs/source_data_contract_v1.md.
Documentation index: /docs/index.md.
Core Modules
Schema Constants
Always use honestroles.schema for consistent column referencing:
from honestroles import schema
# Available constants:
# schema.TITLE, schema.DESCRIPTION_TEXT, schema.COMPANY
# schema.CITY, schema.REGION, schema.COUNTRY
# schema.SALARY_MIN, schema.SALARY_MAX, etc.
Filtering with FilterChain
The FilterChain allows you to compose multiple filtering rules efficiently:
from honestroles import FilterChain, filter_jobs
# Functional approach:
df = filter_jobs(df, remote_only=True, min_salary=100_000)
# Composable approach:
chain = FilterChain()
chain.add(hr.filter.by_keywords, include=["Engineer"], exclude=["Manager"])
chain.add(hr.filter.by_completeness, required_fields=[schema.DESCRIPTION_TEXT, schema.APPLY_URL])
filtered_df = chain.apply(df)
Local LLM Usage (Ollama)
Ensure Ollama is running locally:
ollama serve
ollama pull llama3
Then enable LLM-based labeling or quality rating:
df = hr.label_jobs(df, use_llm=True, model="llama3")
df = hr.rate_jobs(df, use_llm=True, model="llama3")
Package Layout
src/honestroles/
├── clean/ # HTML stripping, normalization, and dedup
├── filter/ # Composed FilterChain and predicates
├── io/ # Parquet and DuckDB I/O with validation
├── label/ # Seniority, Category, and Tech Stack labeling
├── llm/ # Ollama client and prompt templates
├── rate/ # Completeness, Quality, and Composite ratings
└── schema.py # Centralized column name constants
Testing
Run the test suite with pytest:
pytest
Stability
- Changelog:
/CHANGELOG.md - Performance guardrails:
/docs/performance.md - Docs index:
/docs/index.md
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file honestroles-0.1.0.tar.gz.
File metadata
- Download URL: honestroles-0.1.0.tar.gz
- Upload date:
- Size: 66.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0c18137b2261eb67e88c223bc375dde536ff16845a7cbdcf787286d31631133
|
|
| MD5 |
a708e61fc6c8b77ed6686397b09b3759
|
|
| BLAKE2b-256 |
82d1f83d1d80c6e626f9ef4a945361f5fc37ea5c42346a1bde2119cc51ced393
|
File details
Details for the file honestroles-0.1.0-py3-none-any.whl.
File metadata
- Download URL: honestroles-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b30bf6518ffbdc369948be4384a64d9ca24e2f469d56ee54971a75251b40650
|
|
| MD5 |
beb4a99b7b126e128cc569dbd37a5dab
|
|
| BLAKE2b-256 |
a62f669952fe57b61a378f7afa29420b89ff7bbf624cee3bcdb87e51cf7297c1
|