A lightweight file similarity engine using vector indexing
Project description
filemetric-engine
A Python library for comparing text files by similarity percentage — with SHA-256 content caching, persistent TF-IDF indexing, incremental updates, and named group management.
What is filemetric-engine?
filemetric-engine is a local, dependency-light Python library that measures how similar two or more text files are to each other, returning a percentage score you can use directly in your application.
It scales from a quick two-file comparison all the way up to querying a single document against a corpus of 10,000+ files in under 50ms.
Everything runs locally. No API keys, no external services, no data leaves your machine.
Common use cases:
- Plagiarism and duplicate detection
- Contract and legal document comparison
- Code similarity analysis
- Document deduplication pipelines
- Research paper matching
Table of Contents
- filemetric-engine
How It Works
Similarity algorithm
filemetric-engine uses TF-IDF (Term Frequency-Inverse Document Frequency) with unigram + bigram tokenisation, followed by cosine similarity. This is the standard approach for text overlap detection — fast, interpretable, and produces consistent results without requiring a GPU or cloud service.
A score of 0% means the two documents share no common terms. A score of 100% means the documents are identical.
Two-tier index design
For dynamic corpora where new files are uploaded frequently, the engine uses a two-tier architecture to avoid expensive full rebuilds on every write:
New file uploaded
|
v
pending buffer <-- O(1) write, instant
|
| (when buffer reaches merge_threshold)
v
main FileIndex <-- full TF-IDF matrix, rebuilt automatically
|
+-- queries check both layers and merge results
Caching
Files are identified by SHA-256 hash of their raw bytes, not their filename. Processed text is stored in a local SQLite database. If a file has not changed since the last run, it is served from cache with no disk read or re-processing.
Installation
Requirements: Python 3.9 or higher
# 1. Clone the repository
git clone https://github.com/0x0pharaoh/filemetric-engine.git
cd filemetric-engine
# 2. Create a virtual environment
python -m venv .venv
# Mac / Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate
# 3. Install
pip install --upgrade pip setuptools wheel
pip install -e ".[dev]"
Core dependencies:
| Package | Purpose |
|---|---|
scikit-learn |
TF-IDF vectoriser and cosine similarity |
numpy |
Numerical operations |
scipy |
Sparse matrix storage (memory efficient at scale) |
Optional — semantic / meaning-based similarity:
pip install sentence-transformers
Semantic mode compares documents by meaning rather than word overlap — useful when documents paraphrase the same ideas in different words. Requires more memory and is slower than TF-IDF.
Quick Start
from filemetric_engine import compare_files
result = compare_files("document_a.txt", "document_b.txt")
print(result.to_dict())
# {
# "file_1": "document_a.txt",
# "file_2": "document_b.txt",
# "common_in_percentage": 67.34
# }
Usage Guide
1. Simple Pair Comparison
from filemetric_engine import compare_files
result = compare_files("essay.txt", "reference.txt")
print(result.file_1) # "essay.txt"
print(result.file_2) # "reference.txt"
print(result.common_in_percentage) # 67.34
print(result.to_dict()) # full dict
Pass a VectorCache to skip re-reading files you have already processed:
from filemetric_engine import compare_files, VectorCache
cache = VectorCache() # persists to ~/.filemetric_engine/cache.db
r1 = compare_files("a.txt", "b.txt", cache=cache) # reads from disk
r2 = compare_files("a.txt", "c.txt", cache=cache) # a.txt from cache
2. One vs Many
from filemetric_engine import compare_one_to_many
result = compare_one_to_many(
"submission.txt",
["ref1.txt", "ref2.txt", "ref3.txt", "ref4.txt"],
top_n=3, # return only top 3 (optional)
threshold=5.0, # exclude matches below 5% (optional)
sort=True, # sort highest to lowest (default)
)
for match in result.compare:
print(f"{match.percentage}% -> {match.file}")
# 72.4% -> ref1.txt
# 31.1% -> ref3.txt
# 12.8% -> ref2.txt
import json
print(json.dumps(result.to_dict(), indent=2))
For base file lists over ~500 files, use
FileIndexinstead for much faster repeated queries.
3. Large Scale with FileIndex
FileIndex vectorises all base files once, saves the TF-IDF matrix to disk, and lets you run queries in under 50ms regardless of corpus size.
import glob
from filemetric_engine import FileIndex, VectorCache
cache = VectorCache()
base_files = glob.glob("/data/corpus/**/*.txt", recursive=True)
# Build once
idx = FileIndex.build(base_files, cache=cache, verbose=True)
# [FileIndex] Reading 10000 files (8 workers)...
# [FileIndex] Fitting TF-IDF vectorizer...
# [FileIndex] Done. 10000 docs x 198432 features
idx.save("corpus.pkl")
# All subsequent runs: load instantly
idx = FileIndex.load("corpus.pkl")
result = idx.query("new_document.txt", top_n=20, threshold=5.0)
for match in result.compare:
print(f"{match.percentage}% {match.file}")
print(idx.info())
# {
# "files_indexed": 10000,
# "vocab_size": 198432,
# "matrix_shape": [10000, 198432],
# "built_at": "2024-10-18T09:30:00"
# }
4. Dynamic Index — Incremental Uploads
DynamicFileIndex wraps FileIndex with a pending buffer so new files can be added without rebuilding the entire index on every upload.
One-time setup:
from filemetric_engine import DynamicFileIndex, VectorCache
cache = VectorCache()
idx = DynamicFileIndex(
merge_threshold=50, # rebuild main index after 50 buffered files
cache=cache,
)
idx.build_initial(existing_files)
idx.save("my_index.dyn")
On every new file upload:
idx = DynamicFileIndex.load("my_index.dyn", cache=cache)
idx.add_file("/uploads/new_doc.txt") # instant O(1) write
idx.save("my_index.dyn") # auto-merges when threshold is hit
Query:
result = idx.query("compare_this.txt", top_n=10, threshold=5.0)
Add multiple files at once:
idx.add_files(["/uploads/doc_a.txt", "/uploads/doc_b.txt"])
idx.save("my_index.dyn")
Force an immediate rebuild (useful after a bulk import):
idx.force_merge()
idx.save("my_index.dyn")
Remove a file:
idx.remove_file("/uploads/old_doc.txt")
idx.save("my_index.dyn")
Inspect state:
print(idx.info())
# {
# "files_in_main_index": 950,
# "files_in_pending_buffer": 12,
# "total_files": 962,
# "merge_threshold": 50,
# "main_index_built_at": "2024-10-18T09:30:00",
# "pending_files": ["/uploads/doc_x.txt", ...]
# }
Choosing a merge threshold:
| Upload rate | Recommended merge_threshold |
|---|---|
| A few files/day | 10 - 25 |
| Dozens per day | 50 (default) |
| Hundreds per day | 100 - 200 |
| Bulk batch import | Call force_merge() manually after the batch |
5. IndexRegistry — Named Groups
IndexRegistry manages multiple independent DynamicFileIndex instances as named groups, all sharing a single VectorCache.
Use this when your application handles multiple document categories, projects, or users.
Directory layout (managed automatically):
/your/index/dir/
├── registry.json <- group manifest
├── groups/
│ ├── contracts.dyn
│ ├── invoices.dyn
│ └── legal.dyn
└── cache.db <- shared VectorCache
from filemetric_engine import IndexRegistry
with IndexRegistry("/data/my_indexes", merge_threshold=50) as reg:
# Create groups
reg.create_group("contracts", initial_files=glob.glob("/docs/contracts/*.txt"))
reg.create_group("invoices", initial_files=glob.glob("/docs/invoices/*.txt"))
print(reg.list_groups()) # ["contracts", "invoices"]
# Add a file on upload
reg.add_file("contracts", "/uploads/new_contract.txt")
# Add multiple files
reg.add_files("contracts", ["/uploads/a.txt", "/uploads/b.txt"])
# Query a single group
result = reg.query("contracts", "/uploads/mystery_doc.txt", top_n=10)
# Query ALL groups simultaneously
all_results = reg.query_all("/uploads/mystery_doc.txt", top_n=5)
for group_name, group_result in all_results.items():
if group_result.compare:
top = group_result.compare[0]
print(f"[{group_name}] {top.percentage}% {top.file}")
# Remove a file
reg.remove_file("contracts", "/docs/contracts/expired.txt")
# Force rebuild
reg.force_merge("contracts")
# Delete a group
reg.delete_group("invoices")
# Metadata
print(reg.group_info("contracts"))
print(reg.cache_stats())
Reload in a new process or after a server restart:
with IndexRegistry("/data/my_indexes") as reg:
result = reg.query("contracts", "new_doc.txt")
Windows users: Always use
with IndexRegistry(...) as reg:or callreg.close()explicitly before deleting the index directory. This closes the SQLite connection cleanly and preventsPermissionError: [WinError 32].
6. VectorCache
VectorCache stores the processed text of each file in a local SQLite database, keyed by SHA-256 of the file's raw bytes.
from filemetric_engine import VectorCache
# Default location: ~/.filemetric_engine/cache.db
cache = VectorCache()
# Custom path
cache = VectorCache("./project/cache.db")
# Inspect
print(cache.stats())
# {
# "entries": 9832,
# "text_bytes": 48291200,
# "vector_bytes": 0,
# "db_path": "/home/user/.filemetric_engine/cache.db"
# }
# Invalidate one entry by content hash
file_hash = VectorCache.hash_file("document.txt")
cache.invalidate(file_hash)
# Wipe all entries
cache.clear()
cache.close()
Use as a context manager:
with VectorCache() as cache:
result = compare_files("a.txt", "b.txt", cache=cache)
Behaviour:
- Files are identified by content hash, not path — renaming a file does not invalidate its cache entry
- Changing a file's content changes its hash, triggering automatic re-processing on next use
- Identical content at two different paths is stored and processed only once
- Backed by SQLite with WAL mode enabled for safe concurrent reads
7. Query Raw Text
Every query method has a _text variant that accepts a raw string instead of a file path. Useful when content comes from a database, API response, or in-memory buffer.
# Against a FileIndex
idx = FileIndex.load("corpus.pkl")
result = idx.query_text("some raw document text here", top_n=5)
# Against a DynamicFileIndex
idx = DynamicFileIndex.load("my_index.dyn")
result = idx.query_text("raw text to compare", top_n=10, threshold=5.0)
# Against a registry group
with IndexRegistry("/data/my_indexes") as reg:
result = reg.query_text("contracts", "raw contract text here", top_n=5)
for match in result.compare:
print(f"{match.percentage}% {match.file}")
API Reference
compare_files
compare_files(
file_1: str | Path,
file_2: str | Path,
cache: VectorCache | None = None,
) -> PairResult
compare_one_to_many
compare_one_to_many(
main_file: str | Path,
base_files: list[str | Path],
cache: VectorCache | None = None,
top_n: int | None = None,
threshold: float = 0.0,
sort: bool = True,
) -> MultiResult
FileIndex
| Method | Description |
|---|---|
FileIndex.build(base_files, cache=None, max_workers=8, verbose=False) |
Build index from files |
FileIndex.load(path) |
Load a saved index from disk |
.save(path) |
Save index to disk |
.query(main_file, cache=None, top_n=None, threshold=0.0, sort=True) |
Query by file path |
.query_text(text, top_n=None, threshold=0.0) |
Query by raw string |
.add_files(new_files, cache=None) |
Return a new index with additional files |
.info() |
Return index metadata |
DynamicFileIndex
| Method | Description |
|---|---|
DynamicFileIndex(merge_threshold=50, cache=None, max_workers=8) |
Create instance |
.build_initial(base_files, verbose=False) |
Seed from bulk files |
.add_file(path, force_merge=False) -> bool |
Add one file (buffered) |
.add_files(paths, force_merge=False) -> bool |
Add many files (one merge check) |
.remove_file(path) -> bool |
Remove a file |
.force_merge() |
Trigger immediate rebuild |
.save(path) |
Persist to disk |
DynamicFileIndex.load(path, cache=None) |
Load from disk |
.query(main_file, top_n=None, threshold=0.0, sort=True) |
Query both index layers |
.query_text(text, top_n=None, threshold=0.0) |
Query by raw string |
.info() -> dict |
State metadata |
IndexRegistry
| Method | Description |
|---|---|
IndexRegistry(base_dir, merge_threshold=50, shared_cache=True) |
Open or create registry |
.create_group(name, initial_files=None, merge_threshold=None, overwrite=False) |
Create a group |
.delete_group(name) |
Delete a group |
.list_groups() -> list[str] |
All group names |
.group_info(name) -> dict |
Metadata for one group |
.all_info() -> dict |
Metadata for all groups |
.add_file(group, path) -> bool |
Add a file to a group |
.add_files(group, paths) -> bool |
Add multiple files to a group |
.remove_file(group, path) -> bool |
Remove a file from a group |
.force_merge(group) |
Force rebuild for a group |
.query(group, main_file, top_n=None, threshold=0.0) |
Query a single group |
.query_all(main_file, top_n=None, threshold=0.0) -> dict |
Query all groups |
.query_text(group, text, top_n=None, threshold=0.0) |
Query raw text |
.cache_stats() -> dict |
Cache statistics |
.close() |
Close SQLite connection |
Return types
@dataclass
class PairResult:
file_1: str
file_2: str
common_in_percentage: float # 0.0 to 100.0
def to_dict(self) -> dict: ...
@dataclass
class FileMatch:
file: str
percentage: float
def to_dict(self) -> dict: ...
@dataclass
class MultiResult:
file: str
compare: list[FileMatch]
def to_dict(self) -> dict: ...
def top(self, n: int) -> MultiResult: ...
Performance Guide
| Corpus size | Recommended approach | First build | Subsequent queries |
|---|---|---|---|
| < 100 files | compare_one_to_many |
N/A | < 100ms |
| 100 - 1,000 | compare_one_to_many + cache |
N/A | < 1s |
| 1,000 - 50,000 | FileIndex or DynamicFileIndex |
30s - 5min* | < 50ms |
| 50,000+ | FileIndex with max_features tuning |
varies | < 200ms |
*On first build. With cache populated, subsequent rebuilds skip file I/O and only re-fit the TF-IDF matrix.
Reducing memory at scale:
The TF-IDF matrix is stored as a sparse matrix. For very large corpora you can reduce memory usage by capping the vocabulary size:
idx = FileIndex.build(
base_files,
vectorizer_kwargs={"max_features": 50_000}, # default is 200_000
)
Project Structure
filemetric-engine/
├── filemetric_engine/
│ ├── __init__.py <- public API exports
│ ├── types.py <- PairResult, MultiResult, FileMatch
│ ├── cache.py <- VectorCache (SQLite, thread-safe, SHA-256)
│ ├── compare.py <- compare_files(), compare_one_to_many()
│ ├── index.py <- FileIndex, DynamicFileIndex
│ └── registry.py <- IndexRegistry
├── tests/
│ ├── test_filemetric.py <- unit tests: core functions, cache, FileIndex
│ └── test_dynamic.py <- unit tests: DynamicFileIndex, IndexRegistry
├── smoke_test.py <- end-to-end manual test
├── pyproject.toml
├── LICENSE
└── README.md
Contributing
Contributions are welcome and appreciated.
# Fork the repo, then clone your fork
git clone https://github.com/your-username/filemetric-engine.git
cd filemetric-engine
# Create a feature branch
git checkout -b feature/your-feature-name
# Install in dev mode
pip install -e ".[dev]"
# Make your changes, then run the tests
pytest tests/ -v
# Push and open a pull request
git push origin feature/your-feature-name
Guidelines:
- New features should include tests in
tests/ - Keep public API changes backward-compatible where possible
- Confirm
pytest tests/ -vshows 63 passed before submitting - Open an issue first if you are planning a large or breaking change
Testing
Run the full test suite:
pytest tests/ -v
Run the end-to-end smoke test:
python smoke_test.py
The smoke test exercises all layers: pair comparison, one-to-many, cache hit/miss, dynamic index with auto-merge, registry group lifecycle, and persistence across simulated restarts.
Expected output:
filemetric-engine -- smoke test
------------------------------------------------------------
Created 6 sample documents in sample_docs/
============================================================
TEST 1: Simple pair comparison
============================================================
...
============================================================
ALL TESTS PASSED
============================================================
Author
Pharaoh GitHub: @0x0pharaoh
For bugs and feature requests, please open an issue. For questions or collaboration, reach out via GitHub.
License
MIT License — see LICENSE for full terms.
If you find this project useful, consider leaving a star on GitHub.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filemetric_engine-1.1.0.tar.gz.
File metadata
- Download URL: filemetric_engine-1.1.0.tar.gz
- Upload date:
- Size: 28.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4463d5206dc81ed52addab513afa703684c9f2fa7e70bf87da3a38e73d28426
|
|
| MD5 |
8bb1173d7c2c4fee5b459e8ab7d6f089
|
|
| BLAKE2b-256 |
1ab3dca3135b780822b65f90137d177c60f1f0419d04d36a9ad4154d4efb503f
|
Provenance
The following attestation bundles were made for filemetric_engine-1.1.0.tar.gz:
Publisher:
pypi-publish.yml on 0x0pharaoh/filemetric-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filemetric_engine-1.1.0.tar.gz -
Subject digest:
c4463d5206dc81ed52addab513afa703684c9f2fa7e70bf87da3a38e73d28426 - Sigstore transparency entry: 1338965284
- Sigstore integration time:
-
Permalink:
0x0pharaoh/filemetric-engine@9e015725a115a16cb0d7b5b22ae8bf785783157b -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/0x0pharaoh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@9e015725a115a16cb0d7b5b22ae8bf785783157b -
Trigger Event:
push
-
Statement type:
File details
Details for the file filemetric_engine-1.1.0-py3-none-any.whl.
File metadata
- Download URL: filemetric_engine-1.1.0-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aba83ec10fc09cdf97d5e1dddc33e0b1489f7467c4c41a1b503e907e198ffc24
|
|
| MD5 |
04053bc79e3bcb816f584a2919a3ee9d
|
|
| BLAKE2b-256 |
f8246454546288f4e06988bf440d7a91dff310f3a8e7b11c7ecc3cef2e848096
|
Provenance
The following attestation bundles were made for filemetric_engine-1.1.0-py3-none-any.whl:
Publisher:
pypi-publish.yml on 0x0pharaoh/filemetric-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filemetric_engine-1.1.0-py3-none-any.whl -
Subject digest:
aba83ec10fc09cdf97d5e1dddc33e0b1489f7467c4c41a1b503e907e198ffc24 - Sigstore transparency entry: 1338965290
- Sigstore integration time:
-
Permalink:
0x0pharaoh/filemetric-engine@9e015725a115a16cb0d7b5b22ae8bf785783157b -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/0x0pharaoh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@9e015725a115a16cb0d7b5b22ae8bf785783157b -
Trigger Event:
push
-
Statement type: