Open-source Machine Learning Library for Job Data
Project description
ℹ️` Welcome to jobcurator library
jobcurator is an open-source Machine Learning library to clean, normalize, structure, compress, and sample large datasets & feeds of job offers.
✨ Available features:
- Hash-based job deduplication and compression with quality and diversity preservation.
jobcuratortakes a list of structured job objects and:- Deduplicates using hashing (exact hash + SimHash + LSH)
- Scores jobs by length & completion (and optional freshness/source)
- Preserves variance by keeping jobs that are far apart in hash space
- Respects a global compression ratio (e.g., keep 40% of jobs)
No dense embeddings. Fully hashing + simple geometry (3D coordinates for cities).
📋 TODO
- publish package to PyPI:
- add Job Parsing
- add Job dynamic Tagging with Taxonomy
- add job auto-formating & Normalization
📬 Contact
For questions, ideas, or coordination around larger changes:
Primary maintainer 📧 mouhidine.seiv@hrflow.ai
🗂️ Repository structure
jobcurator/
├─ pyproject.toml
├─ setup.py
├─ test.py
├─ logo.png
├─ README.md
└─ src/
└─ jobcurator/
├─ __init__.py
├─ models.py
├─ hash_utils.py
└─ curator.py
🚀 Installation
To install for local Dev:
git clone https://github.com/<your-username>/jobcurator.git
cd jobcurator
pip install -e .
To reinstall for local Dev:
pip uninstall -y jobcurator # ignore error if not installed
pip install -e .
(coming soon) To install the package once published to PyPI:
pip install jobcurator
🧪 Testing code
Run main folder run test.py
python3 test.py # n_jobs=10 (capped to len(jobs)), ratio=0.5
python3 test.py --n-jobs 5 # n_jobs=5, ratio=0.5
python3 test.py --n-jobs 5 --ratio 0.3
🧩 Public API
Import
from jobcurator import JobCurator, Job, Category, SalaryField, Location3DField
from datetime import datetime
Example usage
jobs = [
Job(
id="job-1",
title="Senior Backend Engineer",
text="Full description...",
categories={
"job_function": [
Category(
id="backend",
label="Backend",
level=1,
parent_id="eng",
level_path=["Engineering", "Software", "Backend"],
)
]
},
location=Location3DField(
lat=48.8566,
lon=2.3522,
alt_m=35,
city="Paris",
country_code="FR",
),
salary=SalaryField(
min_value=60000,
max_value=80000,
currency="EUR",
period="year",
),
company="HrFlow.ai",
contract_type="Full-time",
source="direct",
created_at=datetime.utcnow(),
),
]
curator = JobCurator(
ratio=0.4, # keep 40% of jobs
alpha=0.6, # quality vs diversity tradeoff
max_per_cluster_in_pool=3, # max jobs per cluster entering the global pool
)
compressed_jobs = curator.dedupe_and_compress(jobs)
print(len(jobs), "→", len(compressed_jobs))
JobCurator parameters
JobCurator(
ratio: float = 1.0, # default compression ratio
alpha: float = 0.6, # quality vs diversity weight
max_per_cluster_in_pool: int = 3,
d_sim_threshold: int = 20, # SimHash Hamming threshold for clustering
max_cluster_distance_km: float = 150.0, # max distance between cities in same cluster
)
ratio = 1.0→ keep all jobsratio = 0.5→ keep ~50% of jobs (highest quality + diversity)alphacloser to 1 → prioritize quality; closer to 0 → prioritize diversity
🧱 Core Concepts
Job schema
A Job is a structured object with:
id: unique identifiertitle: job title (string)text: full job description (string)categories: hierarchical taxonomy per dimension (dict[str, list[Category]])location:Location3DFieldwith lat/lon/alt (internally converted to 3D x,y,z)salary: optionalSalaryField- Optional:
company,contract_type,source,created_at - Internal fields:
length_score,completion_score_val,quality,exact_hash,signature(computed byJobCurator)
Category schema
A Category is a hierarchical node:
id: unique taxonomy IDlabel: human-readable labellevel: depth in hierarchy (0 = root)parent_id: optional parent category idlevel_path: full path from root (e.g.["Engineering", "Software", "Backend"])
Multiple dimensions (e.g. job_function, industry, seniority) can coexist in categories.
Location schema with 3D coordinates
Location3DField:
lat,lon: in degreesalt_m: altitude in meterscity,country_code: metadatax, y, z: computed Earth-centered coordinates for 3D distance (used to avoid merging jobs from very distant cities)
⚙️ How It Works (High Level)
-
Preprocessing & scoring
-
Compute token length → normalize to
length_score ∈ [0,1](using p10/p90 percentiles). -
Compute
completion_scorebased on presence of key fields (title, text, location, salary, categories, company, contract_type). -
Optional
freshness_scoreandsource_quality. -
Combine into:
quality(j) = 0.3 * length_score + 0.4 * completion_score + 0.2 * freshness_score + 0.1 * source_quality
-
-
Exact hash
- Build a canonical string from title + categories + coarse location + salary bucket + text.
- Use
blake2bto get a 64-bitexact_hash. - Remove strict duplicates.
-
Composite signature (no embeddings)
- 64-bit SimHash on
title + text. - 64-bit feature-hash on categories, location, salary.
- Concatenate into a 128-bit
signature = (simhash << 64) | meta_bits.
- 64-bit SimHash on
-
LSH clustering
-
Use LSH on the SimHash part to find candidate near-duplicates.
-
Accept a pair as same cluster if:
- Hamming distance on SimHash ≤ threshold
- 3D geo distance between locations ≤
max_cluster_distance_km
-
Group jobs into clusters via union–find.
-
-
Intra-cluster ranking
- Within each cluster, sort jobs by
qualitydescending.
- Within each cluster, sort jobs by
-
Global compression with diversity
-
Build a pool with the top N jobs per cluster.
-
Greedy selection:
-
Start from the highest-quality job.
-
Iteratively pick the job maximizing:
diversified_score = alpha * quality + (1 - alpha) * normalized_min_hamming_distance_to_selected
-
-
Stop when you’ve selected
ceil(ratio * N_original)jobs.
-
Result: you keep fewer, higher-quality, and more diverse jobs.
🤝 Contributing
First off, thank you for taking the time to contribute! 🎉 This project aims to provide a robust, hash-based job deduplication & compression engine, and your help is highly appreciated.
🧭 Getting Started
-
Fork the repository on GitHub.
-
Clone your fork locally:
git clone https://github.com/<your-username>/jobcurator.git cd jobcurator
-
Install in editable / dev mode:
pip install -e .
-
Create a feature branch:
git checkout -b feat/my-feature
🐛 Reporting Bugs
Please use GitHub Issues and include:
jobcuratorversion- Python version
- OS
- Minimal reproducible example (code + data schema, no sensitive data)
- Expected vs actual behavior
For security-related or sensitive issues, you can also contact the maintainer directly:
🌱 Suggesting Features
When opening a feature request:
-
Clearly describe the problem you want to solve.
-
Explain how it fits into
jobcurator’s scope:- hash-based dedupe
- compression ratio
- quality scoring
- diversity / variance preservation
-
Optionally include:
- Proposed API shape (function/class signature)
- Example usage snippet
- Notes on performance / complexity if relevant
🧪 Tests & Quality
Before submitting a PR:
-
Add or update tests (e.g. under
tests/):- Edge cases: empty input, single job, all duplicates, all unique.
- Typical cases: mixed locations, mixed sources, various compression ratios.
-
Run the test suite:
pytest
-
Ensure all tests pass.
If your change touches deduplication, scoring, clustering, or selection logic, please add specific tests to cover the change and avoid regressions.
🧹 Code Style & Guidelines
-
Target Python 3.9+.
-
Use type hints for functions, methods, and dataclasses.
-
Keep modules focused:
models.py→ schema & dataclasseshash_utils.py→ hashing, signatures, clustering, quality scorescurator.py→JobCuratororchestration / public API
-
Prefer:
blackfor formattingrufforflake8for linting
Naming conventions:
- Classes:
PascalCase(JobCurator,Location3DField) - Functions:
snake_case(build_exact_hash,geo_distance_km) - Constants:
UPPER_SNAKE_CASE
Avoid introducing heavy dependencies—this library is intentionally lightweight and focused on hashing + simple math.
📦 Public API & Backward Compatibility
The main public API consists of:
jobcurator.JobCuratorjobcurator.Jobjobcurator.Categoryjobcurator.SalaryFieldjobcurator.Location3DField
When changing their behavior or signatures:
-
Consider backward compatibility.
-
Document changes in:
- PR description
README.md(if user-visible behavior changes)
-
For breaking changes, propose a clear migration path and rationale.
📥 Pull Requests
-
Make sure your branch is up to date with
main:git fetch origin git rebase origin/main
-
Push your branch to your fork:
git push origin feat/my-feature
-
Open a PR and include:
- A clear title (e.g.
Add salary band weighting to quality scoring) - Description of what changed and why
- Any performance considerations
- Tests added or updated
- A clear title (e.g.
PRs that are small, focused, and well-tested are more likely to be reviewed and merged quickly.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jobcurator-0.1.0.tar.gz.
File metadata
- Download URL: jobcurator-0.1.0.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1eb37b103010f83f55cc39c19a0a8c789312155b98450f31096b98c4fea22e05
|
|
| MD5 |
b557db189e66c16653edb556232c53b8
|
|
| BLAKE2b-256 |
23fe5fd6503191fb994a79422c354c9e88cc9ee2d8f3af2c7624d53e54c31015
|
Provenance
The following attestation bundles were made for jobcurator-0.1.0.tar.gz:
Publisher:
python-publish.yml on Riminder/jobcurator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
jobcurator-0.1.0.tar.gz -
Subject digest:
1eb37b103010f83f55cc39c19a0a8c789312155b98450f31096b98c4fea22e05 - Sigstore transparency entry: 685900182
- Sigstore integration time:
-
Permalink:
Riminder/jobcurator@4e24e5eddc20dbc2f75da3db0f72dc412550fd64 -
Branch / Tag:
refs/tags/jobcurator-v0.1.0 - Owner: https://github.com/Riminder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4e24e5eddc20dbc2f75da3db0f72dc412550fd64 -
Trigger Event:
release
-
Statement type:
File details
Details for the file jobcurator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: jobcurator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94a27a660ade2868f80e3274519eee10a3ff8303e88d3dc9c47413d8fae1647e
|
|
| MD5 |
a955e8809cb791606f3c3973d26622f6
|
|
| BLAKE2b-256 |
d517a01f43238be9e1e51515a5d07ed09a2f958606a6129b1a4773b8d802e2f7
|
Provenance
The following attestation bundles were made for jobcurator-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on Riminder/jobcurator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
jobcurator-0.1.0-py3-none-any.whl -
Subject digest:
94a27a660ade2868f80e3274519eee10a3ff8303e88d3dc9c47413d8fae1647e - Sigstore transparency entry: 685900183
- Sigstore integration time:
-
Permalink:
Riminder/jobcurator@4e24e5eddc20dbc2f75da3db0f72dc412550fd64 -
Branch / Tag:
refs/tags/jobcurator-v0.1.0 - Owner: https://github.com/Riminder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4e24e5eddc20dbc2f75da3db0f72dc412550fd64 -
Trigger Event:
release
-
Statement type: