Open-source Machine Learning Library for Job Data

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

hrflow

These details have not been verified by PyPI

Project description

ℹ️` Welcome to jobcurator library

jobcurator is an open-source Machine Learning library to clean, normalize, structure, compress, and sample large datasets & feeds of job offers.

✨ Available features:

Hash-based job deduplication and compression with quality and diversity preservation. jobcurator takes a list of structured job objects and:
- Deduplicates using hashing (exact hash + SimHash + LSH)
- Scores jobs by length & completion (and optional freshness/source)
- Preserves variance by keeping jobs that are far apart in hash space
- Respects a global compression ratio (e.g., keep 40% of jobs)

No dense embeddings. Fully hashing + simple geometry (3D coordinates for cities).

📋 TODO

publish package to PyPI:
add Job Parsing
add Job dynamic Tagging with Taxonomy
add job auto-formating & Normalization

📬 Contact

For questions, ideas, or coordination around larger changes:

Primary maintainer 📧 mouhidine.seiv@hrflow.ai

🗂️ Repository structure

jobcurator/
├─ pyproject.toml
├─ setup.py
├─ test.py
├─ logo.png
├─ README.md
└─ src/
   └─ jobcurator/
      ├─ __init__.py
      ├─ models.py
      ├─ hash_utils.py
      └─ curator.py

🚀 Installation

To install for local Dev:

git clone https://github.com/<your-username>/jobcurator.git
cd jobcurator
pip install -e .

To reinstall for local Dev:

pip uninstall -y jobcurator  # ignore error if not installed
pip install -e .

(coming soon) To install the package once published to PyPI:

pip install jobcurator

🧪 Testing code

Run main folder run test.py

python3 test.py                   # n_jobs=10 (capped to len(jobs)), ratio=0.5
python3 test.py --n-jobs 5        # n_jobs=5, ratio=0.5
python3 test.py --n-jobs 5 --ratio 0.3

🧩 Public API

Import

from jobcurator import JobCurator, Job, Category, SalaryField, Location3DField
from datetime import datetime

Example usage

jobs = [
    Job(
        id="job-1",
        title="Senior Backend Engineer",
        text="Full description...",
        categories={
            "job_function": [
                Category(
                    id="backend",
                    label="Backend",
                    level=1,
                    parent_id="eng",
                    level_path=["Engineering", "Software", "Backend"],
                )
            ]
        },
        location=Location3DField(
            lat=48.8566,
            lon=2.3522,
            alt_m=35,
            city="Paris",
            country_code="FR",
        ),
        salary=SalaryField(
            min_value=60000,
            max_value=80000,
            currency="EUR",
            period="year",
        ),
        company="HrFlow.ai",
        contract_type="Full-time",
        source="direct",
        created_at=datetime.utcnow(),
    ),
]

curator = JobCurator(
    ratio=0.4,                 # keep 40% of jobs
    alpha=0.6,                 # quality vs diversity tradeoff
    max_per_cluster_in_pool=3, # max jobs per cluster entering the global pool
)

compressed_jobs = curator.dedupe_and_compress(jobs)
print(len(jobs), "→", len(compressed_jobs))

JobCurator parameters

JobCurator(
    ratio: float = 1.0,              # default compression ratio
    alpha: float = 0.6,              # quality vs diversity weight
    max_per_cluster_in_pool: int = 3,
    d_sim_threshold: int = 20,       # SimHash Hamming threshold for clustering
    max_cluster_distance_km: float = 150.0,  # max distance between cities in same cluster
)

ratio = 1.0 → keep all jobs
ratio = 0.5 → keep ~50% of jobs (highest quality + diversity)
alpha closer to 1 → prioritize quality; closer to 0 → prioritize diversity

🧱 Core Concepts

Job schema

A Job is a structured object with:

id: unique identifier
title: job title (string)
text: full job description (string)
categories: hierarchical taxonomy per dimension (dict[str, list[Category]])
location: Location3DField with lat/lon/alt (internally converted to 3D x,y,z)
salary: optional SalaryField
Optional: company, contract_type, source, created_at
Internal fields: length_score, completion_score_val, quality, exact_hash, signature (computed by JobCurator)

Category schema

A Category is a hierarchical node:

id: unique taxonomy ID
label: human-readable label
level: depth in hierarchy (0 = root)
parent_id: optional parent category id
level_path: full path from root (e.g. ["Engineering", "Software", "Backend"])

Multiple dimensions (e.g. job_function, industry, seniority) can coexist in categories.

Location schema with 3D coordinates

Location3DField:

lat, lon: in degrees
alt_m: altitude in meters
city, country_code: metadata
x, y, z: computed Earth-centered coordinates for 3D distance (used to avoid merging jobs from very distant cities)

⚙️ How It Works (High Level)

Preprocessing & scoring
- Compute token length → normalize to length_score ∈ [0,1] (using p10/p90 percentiles).
- Compute completion_score based on presence of key fields (title, text, location, salary, categories, company, contract_type).
- Optional freshness_score and source_quality.
- Combine into:
```
quality(j) = 0.3 * length_score
           + 0.4 * completion_score
           + 0.2 * freshness_score
           + 0.1 * source_quality
```
Exact hash
- Build a canonical string from title + categories + coarse location + salary bucket + text.
- Use blake2b to get a 64-bit exact_hash.
- Remove strict duplicates.
Composite signature (no embeddings)
- 64-bit SimHash on title + text.
- 64-bit feature-hash on categories, location, salary.
- Concatenate into a 128-bit signature = (simhash << 64) | meta_bits.
LSH clustering
- Use LSH on the SimHash part to find candidate near-duplicates.
- Accept a pair as same cluster if:
  - Hamming distance on SimHash ≤ threshold
  - 3D geo distance between locations ≤ max_cluster_distance_km
- Group jobs into clusters via union–find.
Intra-cluster ranking
- Within each cluster, sort jobs by quality descending.
Global compression with diversity
- Build a pool with the top N jobs per cluster.
- Greedy selection:
  - Start from the highest-quality job.
  - Iteratively pick the job maximizing:
```
diversified_score = alpha * quality + (1 - alpha) * normalized_min_hamming_distance_to_selected
```
- Stop when you’ve selected ceil(ratio * N_original) jobs.

Result: you keep fewer, higher-quality, and more diverse jobs.

🤝 Contributing

First off, thank you for taking the time to contribute! 🎉 This project aims to provide a robust, hash-based job deduplication & compression engine, and your help is highly appreciated.

🧭 Getting Started

Fork the repository on GitHub.

Clone your fork locally:

git clone https://github.com/<your-username>/jobcurator.git
cd jobcurator

Install in editable / dev mode:
```
pip install -e .
```
Create a feature branch:
```
git checkout -b feat/my-feature
```

🐛 Reporting Bugs

Please use GitHub Issues and include:

jobcurator version
Python version
OS
Minimal reproducible example (code + data schema, no sensitive data)
Expected vs actual behavior

For security-related or sensitive issues, you can also contact the maintainer directly:

📧 mouhidine.seiv@hrflow.ai

🌱 Suggesting Features

When opening a feature request:

Clearly describe the problem you want to solve.
Explain how it fits into jobcurator’s scope:
- hash-based dedupe
- compression ratio
- quality scoring
- diversity / variance preservation
Optionally include:
- Proposed API shape (function/class signature)
- Example usage snippet
- Notes on performance / complexity if relevant

🧪 Tests & Quality

Before submitting a PR:

Add or update tests (e.g. under tests/):
- Edge cases: empty input, single job, all duplicates, all unique.
- Typical cases: mixed locations, mixed sources, various compression ratios.
Run the test suite:
```
pytest
```
Ensure all tests pass.

If your change touches deduplication, scoring, clustering, or selection logic, please add specific tests to cover the change and avoid regressions.

🧹 Code Style & Guidelines

Target Python 3.9+.
Use type hints for functions, methods, and dataclasses.
Keep modules focused:
- models.py → schema & dataclasses
- hash_utils.py → hashing, signatures, clustering, quality scores
- curator.py → JobCurator orchestration / public API
Prefer:
- black for formatting
- ruff or flake8 for linting

Naming conventions:

Classes: PascalCase (JobCurator, Location3DField)
Functions: snake_case (build_exact_hash, geo_distance_km)
Constants: UPPER_SNAKE_CASE

Avoid introducing heavy dependencies—this library is intentionally lightweight and focused on hashing + simple math.

📦 Public API & Backward Compatibility

The main public API consists of:

jobcurator.JobCurator
jobcurator.Job
jobcurator.Category
jobcurator.SalaryField
jobcurator.Location3DField

When changing their behavior or signatures:

Consider backward compatibility.
Document changes in:
- PR description
- README.md (if user-visible behavior changes)
For breaking changes, propose a clear migration path and rationale.

📥 Pull Requests

Make sure your branch is up to date with main:

git fetch origin
git rebase origin/main

Push your branch to your fork:
```
git push origin feat/my-feature
```
Open a PR and include:
- A clear title (e.g. Add salary band weighting to quality scoring)
- Description of what changed and why
- Any performance considerations
- Tests added or updated

PRs that are small, focused, and well-tested are more likely to be reviewed and merged quickly.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

hrflow

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.4

Nov 11, 2025

1.2.3

Nov 11, 2025

0.1.11

Nov 11, 2025

This version

0.1.0

Nov 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jobcurator-0.1.0.tar.gz (16.1 kB view details)

Uploaded Nov 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jobcurator-0.1.0-py3-none-any.whl (12.8 kB view details)

Uploaded Nov 9, 2025 Python 3

File details

Details for the file jobcurator-0.1.0.tar.gz.

File metadata

Download URL: jobcurator-0.1.0.tar.gz
Upload date: Nov 9, 2025
Size: 16.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jobcurator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1eb37b103010f83f55cc39c19a0a8c789312155b98450f31096b98c4fea22e05`
MD5	`b557db189e66c16653edb556232c53b8`
BLAKE2b-256	`23fe5fd6503191fb994a79422c354c9e88cc9ee2d8f3af2c7624d53e54c31015`

See more details on using hashes here.

Provenance

The following attestation bundles were made for jobcurator-0.1.0.tar.gz:

Publisher: python-publish.yml on Riminder/jobcurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: jobcurator-0.1.0.tar.gz
- Subject digest: 1eb37b103010f83f55cc39c19a0a8c789312155b98450f31096b98c4fea22e05
- Sigstore transparency entry: 685900182
- Sigstore integration time: Nov 9, 2025
Source repository:
- Permalink: Riminder/jobcurator@4e24e5eddc20dbc2f75da3db0f72dc412550fd64
- Branch / Tag: refs/tags/jobcurator-v0.1.0
- Owner: https://github.com/Riminder
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@4e24e5eddc20dbc2f75da3db0f72dc412550fd64
- Trigger Event: release

File details

Details for the file jobcurator-0.1.0-py3-none-any.whl.

File metadata

Download URL: jobcurator-0.1.0-py3-none-any.whl
Upload date: Nov 9, 2025
Size: 12.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jobcurator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94a27a660ade2868f80e3274519eee10a3ff8303e88d3dc9c47413d8fae1647e`
MD5	`a955e8809cb791606f3c3973d26622f6`
BLAKE2b-256	`d517a01f43238be9e1e51515a5d07ed09a2f958606a6129b1a4773b8d802e2f7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for jobcurator-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on Riminder/jobcurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: jobcurator-0.1.0-py3-none-any.whl
- Subject digest: 94a27a660ade2868f80e3274519eee10a3ff8303e88d3dc9c47413d8fae1647e
- Sigstore transparency entry: 685900183
- Sigstore integration time: Nov 9, 2025
Source repository:
- Permalink: Riminder/jobcurator@4e24e5eddc20dbc2f75da3db0f72dc412550fd64
- Branch / Tag: refs/tags/jobcurator-v0.1.0
- Owner: https://github.com/Riminder
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@4e24e5eddc20dbc2f75da3db0f72dc412550fd64
- Trigger Event: release

jobcurator 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ℹ️` Welcome to jobcurator library

✨ Available features:

📋 TODO

📬 Contact

🗂️ Repository structure

🚀 Installation

🧪 Testing code

🧩 Public API

Import

Example usage

JobCurator parameters

🧱 Core Concepts

Job schema

Category schema

Location schema with 3D coordinates

⚙️ How It Works (High Level)

🤝 Contributing

🧭 Getting Started

🐛 Reporting Bugs

🌱 Suggesting Features

🧪 Tests & Quality

🧹 Code Style & Guidelines

📦 Public API & Backward Compatibility

📥 Pull Requests

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance