A modular, hybrid, and customizable document similarity framework.

These details have not been verified by PyPI

Project description

SimilarityTool

SimilarityTool is a high-performance Information Retrieval (IR) and re-ranking pipeline designed for accurate and efficient matching across large-scale, long-text corpora (e.g., curricula, job descriptions, CVs, and project portfolios). SimilarityTool follows a SSS approach, leaning on semantic, syntactic, and structured features to match documents based on core meaning, regardless of domain.

The framework implements a highly optimized Waterfall Architecture:

Abstractive Ingestion Pass: A local small language model processes long text chunks concurrently to strip fluff and isolate core meaning.
Semantic Encoding: Blends multilingual, structural, and domain-focused transformers into a highly descriptive, high-dimensional embedding.
Syntatic Encoding: Supports semantic encoding with n-gram and keyword encoding, taking a more syntatic approach.
Structured Encoding: Incorporate domain- and use case-specific structured features, adding a more structural perpsective to document matching.
Stage-1 Recall: Lightning-fast retrieval of candidates using a vectorized FAISS index.
Stage-2 Re-ranking: Evaluates retrieved candidates via multi-channel linear fusion containing point-to-point token syntactic analysis, attribute-level Tversky set overlaps, and deep token-interaction cross-encoding.

Configuration Setup

The framework is governed by two clean YAML files. Update your parameters inside your project directory configuration files:

1. Main Pipeline Configuration (`configs/main_config.yaml`)

semantic_engine:
  models:
    - name: "sentence-transformers/all-mpnet-base-v2"
      weight: 1.0
      device: "cuda"
    - name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
      weight: 0.6
      device: "cuda"
    - name: "shawhin/distilroberta-ai-job-embeddings"
      weight: 1.5
      device: "cuda"

storage:
  db_path: "data/corpus.db"
  index_path: "data/corpus.index"
  vector_dimension: 1920  # Matched perfectly to Concatenated Model Vectors (768 + 384 + 768)

orchestrator:
  strategy: "concatenate"
  weights:
    semantic: 0.5   # Cross-Encoder definitive strength
    syntactic: 0.2  # Min-Max Normalized pool token matching 
    structured: 0.3 # Tversky criteria matching`

2. Domain-specific Schema Rules (`configs/schema_config.yaml`)

text_fields:
  - name: "text"
    semantic_weight: 0.7
    syntactic_weight: 0.3
  - name: "title"
    semantic_weight: 1.0
    syntactic_weight: 0.0

structured_collections:
  - name: "tasks"
    alpha: 1.0
    beta: 1.0
    weight: 0.5
  - name: "skills"
    alpha: 0.2
    beta: 2.5   # Heavy penalty for candidates missing requested skills
    weight: 0.3
  - name: "ai"
    alpha: 1.0
    beta: 1.0
    weight: 0.2

Pipeline Usage Guide

Batch Ingestion

Ingest vast datasets from a Pandas DataFrame.

import pandas as pd
from similarity_tool import SimilarityTool
from similarity_tool.utils import DataMapper

# 1. Initialize the tool
tool = SimilarityTool(
    main_config="configs/main_config.yaml", 
    schema_config="configs/schema_config.yaml"
)

# 2. Ingestion example
raw_data = {
    "doc_id": ["id_843125", "id_941012"],
    "title": ["Senior Deep Learning Architect", "Full-Stack Dev"],
    "description": [
        "Massive long 3000-word corporate description containing boilerplate benefits...",
        "Looking for a web application developer specializing in React and Python..."
    ],
    "skills": ["Python,PyTorch,CUDA,Docker", "JavaScript,React,Postgres"],
    "tasks": ["architecture,deployment", "frontend,api"],
    "ai": ["LLMs"]
}
df = pd.DataFrame(raw_data)

# 3. Trigger optimized transactional batch ingestion
DataMapper.batch_ingest_dataframe(
    tool=tool,
    df=df,
    text_columns={"description": "full_text", "title": "job_title"},
    collection_columns={"skills_required": "skills", "core_tasks": "tasks", "ai": "ai"},
    id_column="doc_id",
    delimiter=",",
    batch_size=16 
)

Query Search (1:N)

Execute a query on a target document.

# Construct a target query mapping document matching schema attributes
query = {
    "text_fields": {
        "job_title": "AI Infrastructure Engineer",
        "full_text": "Deploying deep learning models at scale using PyTorch and tuning custom CUDA kernels."
    },
    "collections": {
        "skills": ["Python", "PyTorch", "CUDA"],
        "tasks": ["architecture", "deployment"]
    }
}

# Run the queryt 
# limit: FAISS candidate subset retrieval boundary (lower is quicker, but less broad of a search)
# top_k: Final returned target slice
results = tool.search(query, limit=50, top_k=3)

# Display results
for rank, match in enumerate(results, 1):
    print(f"Rank {rank}: Doc ID = {match['id']} | Total Score = {match['total_score']}")
    print(f"  └─ Sem Cross: {match['breakdown']['semantic_cross']} | Syn: {match['breakdown']['syntactic']} | Str: {match['breakdown']['structured']}\n")

N:N Composite Document Search

Find documents that match the combined profile of multiple query documents simultaneously.

queries = [
    {
        "text_fields": {"job_title": "AI Architect", "full_text": "Expertise optimizing distributed CUDA clusters."},
        "collections": {"skills": ["CUDA", "C++"], "tasks": ["infrastructure"]}
    },
    {
        "text_fields": {"job_title": "ML DevOps Engineer", "full_text": "Building orchestration templates via Docker and PyTorch."},
        "collections": {"skills": ["PyTorch", "Docker"], "tasks": ["deployment"]}
    }
]

# Find the best matches across the corpus that fit this combined query documents
fused_results = tool.search_composite(queries, limit=50, top_k=5)

for rank, match in enumerate(fused_results, 1):
    print(f"Composite Rank {rank}: Doc ID = {match['id']} | Unified Score = {match['total_score']}")

1:1 Document Comparison

doc_a = {
    "text_fields": {"job_title": "Data Scientist", "full_text": "Focusing on pandas and scikit-learn models."},
    "collections": {"skills": ["Python", "Scikit-Learn"], "tasks": ["modeling"]}
}

doc_b = {
    "text_fields": {"job_title": "ML Engineer", "full_text": "Building predictive scikit-learn setups in python."},
    "collections": {"skills": ["Python", "Scikit-Learn", "Docker"], "tasks": ["modeling", "devops"]}
}

comparison = tool.compare(doc_a, doc_b)

Hyperparameter Tuning and Hot-Swapping Configuration (Advanced)

Fine-tune structural weights, Tversky penalties, and any other paramters on the fly without re-instantiating the tool.

tool.update_config('orchestrator', 'weights', {'semantic': 0.8, 'syntactic': 0.1, 'structured': 0.1})
run_a = tool.search(query, limit=50, top_k=1)

tool.update_config(
    category='schema', 
    key='structured_collections', 
    value={'alpha': 0.2, 'beta': 3.5, 'weight': 0.9}, 
    target_name='skills'
)

tool.update_config('orchestrator', 'weights', {'semantic': 0.2, 'syntactic': 0.1, 'structured': 0.7})

run_b = tool.search(query, limit=50, top_k=1)

SLM Distillation to Domain-specific Pre-processing (Advanced)

In order to handle potentially large or noisy input documents, you also have the option to "distill" these documents to include only the most important information. This can be done in two ways:

tool = SimilarityTool(
    main_config="configs/main_config.yaml", 
    schema_config="configs/schema_config.yaml",
    use_llm_distillation=True
)

This will automatically boot a small language model, which will be put into action during the search process.

Alternatively, use the DSTPR library to preprocess input documents based on a defined profile:

tool = SimilarityTool(
    main_config="configs/main_config.yaml", 
    schema_config="configs/schema_config.yaml",
    use_dstpr=True
)

# during search
results = tool.search(query, limit=50, top_k=3, preproc_threshold=0.25) # the higher, the "stricter" the filtering

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.4.1

Jun 22, 2026

0.4.0

Jun 22, 2026

0.3.23

Jun 22, 2026

This version

0.3.22

Jun 22, 2026

0.3.5

Jun 22, 2026

0.3.4

Jun 22, 2026

0.3.1

Jun 21, 2026

0.3.0

Jun 21, 2026

0.2.16

Jun 21, 2026

0.2.0

Jun 17, 2026

0.1.5

May 24, 2026

0.1.4

May 24, 2026

0.1.3

May 24, 2026

0.1.2

May 23, 2026

0.1.1

May 23, 2026

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

similarity_tool-0.3.22.tar.gz (24.6 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

similarity_tool-0.3.22-py3-none-any.whl (25.0 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file similarity_tool-0.3.22.tar.gz.

File metadata

Download URL: similarity_tool-0.3.22.tar.gz
Upload date: Jun 22, 2026
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for similarity_tool-0.3.22.tar.gz
Algorithm	Hash digest
SHA256	`7f6497b0aa07253858e787c163fe16449f93bfdeb745154848fa4eb4641c772b`
MD5	`a594feede27da137c8d2d197f7de0263`
BLAKE2b-256	`73081ac6f3532699f5113f027ce675198c869f4edb0409c7ac9d2ed2189cfeb5`

See more details on using hashes here.

File details

Details for the file similarity_tool-0.3.22-py3-none-any.whl.

File metadata

Download URL: similarity_tool-0.3.22-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 25.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for similarity_tool-0.3.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4a817119f2a355ca844127231e8659681e16ecd19a73bc5f465518f785e057d`
MD5	`160780d71407d7f1894487e52c940b74`
BLAKE2b-256	`1688de4caa308042e10dc9df246aaff7e983444db87edfe3f41206d8756dc1b4`

See more details on using hashes here.

similarity-tool 0.3.22

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

SimilarityTool

Configuration Setup

1. Main Pipeline Configuration (`configs/main_config.yaml`)

2. Domain-specific Schema Rules (`configs/schema_config.yaml`)

Pipeline Usage Guide

Batch Ingestion

Query Search (1:N)

N:N Composite Document Search

1:1 Document Comparison

Hyperparameter Tuning and Hot-Swapping Configuration (Advanced)

SLM Distillation to Domain-specific Pre-processing (Advanced)

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

similarity-tool 0.3.22

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

SimilarityTool

Configuration Setup

1. Main Pipeline Configuration (configs/main_config.yaml)

2. Domain-specific Schema Rules (configs/schema_config.yaml)

Pipeline Usage Guide

Batch Ingestion

Query Search (1:N)

N:N Composite Document Search

1:1 Document Comparison

Hyperparameter Tuning and Hot-Swapping Configuration (Advanced)

SLM Distillation to Domain-specific Pre-processing (Advanced)

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Main Pipeline Configuration (`configs/main_config.yaml`)

2. Domain-specific Schema Rules (`configs/schema_config.yaml`)