A lightweight file similarity engine using vector indexing

These details have not been verified by PyPI

Project description

filemetric-engine

A Python library for comparing text files by similarity percentage — with SHA-256 content caching, persistent TF-IDF indexing, incremental updates, and named group management.

What is filemetric-engine?

filemetric-engine is a local, dependency-light Python library that measures how similar two or more text files are to each other, returning a percentage score you can use directly in your application.

It scales from a quick two-file comparison all the way up to querying a single document against a corpus of 10,000+ files in under 50ms.

Everything runs locally. No API keys, no external services, no data leaves your machine.

Common use cases:

Plagiarism and duplicate detection
Contract and legal document comparison
Code similarity analysis
Document deduplication pipelines
Research paper matching

filemetric-engine

How It Works

Similarity algorithm

filemetric-engine uses TF-IDF (Term Frequency-Inverse Document Frequency) with unigram + bigram tokenisation, followed by cosine similarity. This is the standard approach for text overlap detection — fast, interpretable, and produces consistent results without requiring a GPU or cloud service.

A score of 0% means the two documents share no common terms. A score of 100% means the documents are identical.

Two-tier index design

For dynamic corpora where new files are uploaded frequently, the engine uses a two-tier architecture to avoid expensive full rebuilds on every write:

New file uploaded
      |
      v
 pending buffer        <-- O(1) write, instant
      |
      | (when buffer reaches merge_threshold)
      v
 main FileIndex        <-- full TF-IDF matrix, rebuilt automatically
      |
      +-- queries check both layers and merge results

Caching

Files are identified by SHA-256 hash of their raw bytes, not their filename. Processed text is stored in a local SQLite database. If a file has not changed since the last run, it is served from cache with no disk read or re-processing.

Installation

Requirements: Python 3.9 or higher

# 1. Clone the repository
git clone https://github.com/0x0pharaoh/filemetric-engine.git
cd filemetric-engine

# 2. Create a virtual environment
python -m venv .venv

# Mac / Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

# 3. Install
pip install --upgrade pip setuptools wheel
pip install -e ".[dev]"

Core dependencies:

Package	Purpose
`scikit-learn`	TF-IDF vectoriser and cosine similarity
`numpy`	Numerical operations
`scipy`	Sparse matrix storage (memory efficient at scale)

Optional — semantic / meaning-based similarity:

pip install sentence-transformers

Semantic mode compares documents by meaning rather than word overlap — useful when documents paraphrase the same ideas in different words. Requires more memory and is slower than TF-IDF.

Quick Start

from filemetric_engine import compare_files

result = compare_files("document_a.txt", "document_b.txt")
print(result.to_dict())
# {
#   "file_1": "document_a.txt",
#   "file_2": "document_b.txt",
#   "common_in_percentage": 67.34
# }

Usage Guide

1. Simple Pair Comparison

from filemetric_engine import compare_files

result = compare_files("essay.txt", "reference.txt")

print(result.file_1)                # "essay.txt"
print(result.file_2)                # "reference.txt"
print(result.common_in_percentage)  # 67.34
print(result.to_dict())             # full dict

Pass a VectorCache to skip re-reading files you have already processed:

from filemetric_engine import compare_files, VectorCache

cache = VectorCache()  # persists to ~/.filemetric_engine/cache.db

r1 = compare_files("a.txt", "b.txt", cache=cache)  # reads from disk
r2 = compare_files("a.txt", "c.txt", cache=cache)  # a.txt from cache

2. One vs Many

from filemetric_engine import compare_one_to_many

result = compare_one_to_many(
    "submission.txt",
    ["ref1.txt", "ref2.txt", "ref3.txt", "ref4.txt"],
    top_n=3,        # return only top 3 (optional)
    threshold=5.0,  # exclude matches below 5% (optional)
    sort=True,      # sort highest to lowest (default)
)

for match in result.compare:
    print(f"{match.percentage}%  ->  {match.file}")

# 72.4%  ->  ref1.txt
# 31.1%  ->  ref3.txt
# 12.8%  ->  ref2.txt

import json
print(json.dumps(result.to_dict(), indent=2))

For base file lists over ~500 files, use FileIndex instead for much faster repeated queries.

3. Large Scale with FileIndex

FileIndex vectorises all base files once, saves the TF-IDF matrix to disk, and lets you run queries in under 50ms regardless of corpus size.

import glob
from filemetric_engine import FileIndex, VectorCache

cache = VectorCache()
base_files = glob.glob("/data/corpus/**/*.txt", recursive=True)

# Build once
idx = FileIndex.build(base_files, cache=cache, verbose=True)
# [FileIndex] Reading 10000 files (8 workers)...
# [FileIndex] Fitting TF-IDF vectorizer...
# [FileIndex] Done. 10000 docs x 198432 features

idx.save("corpus.pkl")

# All subsequent runs: load instantly
idx = FileIndex.load("corpus.pkl")

result = idx.query("new_document.txt", top_n=20, threshold=5.0)

for match in result.compare:
    print(f"{match.percentage}%  {match.file}")

print(idx.info())
# {
#   "files_indexed": 10000,
#   "vocab_size": 198432,
#   "matrix_shape": [10000, 198432],
#   "built_at": "2024-10-18T09:30:00"
# }

4. Dynamic Index — Incremental Uploads

DynamicFileIndex wraps FileIndex with a pending buffer so new files can be added without rebuilding the entire index on every upload.

One-time setup:

from filemetric_engine import DynamicFileIndex, VectorCache

cache = VectorCache()

idx = DynamicFileIndex(
    merge_threshold=50,  # rebuild main index after 50 buffered files
    cache=cache,
)
idx.build_initial(existing_files)
idx.save("my_index.dyn")

On every new file upload:

idx = DynamicFileIndex.load("my_index.dyn", cache=cache)
idx.add_file("/uploads/new_doc.txt")  # instant O(1) write
idx.save("my_index.dyn")              # auto-merges when threshold is hit

Query:

result = idx.query("compare_this.txt", top_n=10, threshold=5.0)

Add multiple files at once:

idx.add_files(["/uploads/doc_a.txt", "/uploads/doc_b.txt"])
idx.save("my_index.dyn")

Force an immediate rebuild (useful after a bulk import):

idx.force_merge()
idx.save("my_index.dyn")

Remove a file:

idx.remove_file("/uploads/old_doc.txt")
idx.save("my_index.dyn")

Inspect state:

print(idx.info())
# {
#   "files_in_main_index": 950,
#   "files_in_pending_buffer": 12,
#   "total_files": 962,
#   "merge_threshold": 50,
#   "main_index_built_at": "2024-10-18T09:30:00",
#   "pending_files": ["/uploads/doc_x.txt", ...]
# }

Choosing a merge threshold:

Upload rate	Recommended `merge_threshold`
A few files/day	`10 - 25`
Dozens per day	`50` (default)
Hundreds per day	`100 - 200`
Bulk batch import	Call `force_merge()` manually after the batch

5. IndexRegistry — Named Groups

IndexRegistry manages multiple independent DynamicFileIndex instances as named groups, all sharing a single VectorCache.

Use this when your application handles multiple document categories, projects, or users.

Directory layout (managed automatically):

/your/index/dir/
├── registry.json        <- group manifest
├── groups/
│   ├── contracts.dyn
│   ├── invoices.dyn
│   └── legal.dyn
└── cache.db             <- shared VectorCache

from filemetric_engine import IndexRegistry

with IndexRegistry("/data/my_indexes", merge_threshold=50) as reg:

    # Create groups
    reg.create_group("contracts", initial_files=glob.glob("/docs/contracts/*.txt"))
    reg.create_group("invoices",  initial_files=glob.glob("/docs/invoices/*.txt"))

    print(reg.list_groups())  # ["contracts", "invoices"]

    # Add a file on upload
    reg.add_file("contracts", "/uploads/new_contract.txt")

    # Add multiple files
    reg.add_files("contracts", ["/uploads/a.txt", "/uploads/b.txt"])

    # Query a single group
    result = reg.query("contracts", "/uploads/mystery_doc.txt", top_n=10)

    # Query ALL groups simultaneously
    all_results = reg.query_all("/uploads/mystery_doc.txt", top_n=5)
    for group_name, group_result in all_results.items():
        if group_result.compare:
            top = group_result.compare[0]
            print(f"[{group_name}] {top.percentage}%  {top.file}")

    # Remove a file
    reg.remove_file("contracts", "/docs/contracts/expired.txt")

    # Force rebuild
    reg.force_merge("contracts")

    # Delete a group
    reg.delete_group("invoices")

    # Metadata
    print(reg.group_info("contracts"))
    print(reg.cache_stats())

Reload in a new process or after a server restart:

with IndexRegistry("/data/my_indexes") as reg:
    result = reg.query("contracts", "new_doc.txt")

Windows users: Always use with IndexRegistry(...) as reg: or call reg.close() explicitly before deleting the index directory. This closes the SQLite connection cleanly and prevents PermissionError: [WinError 32].

6. VectorCache

VectorCache stores the processed text of each file in a local SQLite database, keyed by SHA-256 of the file's raw bytes.

from filemetric_engine import VectorCache

# Default location: ~/.filemetric_engine/cache.db
cache = VectorCache()

# Custom path
cache = VectorCache("./project/cache.db")

# Inspect
print(cache.stats())
# {
#   "entries": 9832,
#   "text_bytes": 48291200,
#   "vector_bytes": 0,
#   "db_path": "/home/user/.filemetric_engine/cache.db"
# }

# Invalidate one entry by content hash
file_hash = VectorCache.hash_file("document.txt")
cache.invalidate(file_hash)

# Wipe all entries
cache.clear()

cache.close()

Use as a context manager:

with VectorCache() as cache:
    result = compare_files("a.txt", "b.txt", cache=cache)

Behaviour:

Files are identified by content hash, not path — renaming a file does not invalidate its cache entry
Changing a file's content changes its hash, triggering automatic re-processing on next use
Identical content at two different paths is stored and processed only once
Backed by SQLite with WAL mode enabled for safe concurrent reads

7. Query Raw Text

Every query method has a _text variant that accepts a raw string instead of a file path. Useful when content comes from a database, API response, or in-memory buffer.

# Against a FileIndex
idx = FileIndex.load("corpus.pkl")
result = idx.query_text("some raw document text here", top_n=5)

# Against a DynamicFileIndex
idx = DynamicFileIndex.load("my_index.dyn")
result = idx.query_text("raw text to compare", top_n=10, threshold=5.0)

# Against a registry group
with IndexRegistry("/data/my_indexes") as reg:
    result = reg.query_text("contracts", "raw contract text here", top_n=5)

for match in result.compare:
    print(f"{match.percentage}%  {match.file}")

API Reference

`compare_files`

compare_files(
    file_1: str | Path,
    file_2: str | Path,
    cache: VectorCache | None = None,
) -> PairResult

`compare_one_to_many`

compare_one_to_many(
    main_file: str | Path,
    base_files: list[str | Path],
    cache: VectorCache | None = None,
    top_n: int | None = None,
    threshold: float = 0.0,
    sort: bool = True,
) -> MultiResult

`FileIndex`

Method	Description
`FileIndex.build(base_files, cache=None, max_workers=8, verbose=False)`	Build index from files
`FileIndex.load(path)`	Load a saved index from disk
`.save(path)`	Save index to disk
`.query(main_file, cache=None, top_n=None, threshold=0.0, sort=True)`	Query by file path
`.query_text(text, top_n=None, threshold=0.0)`	Query by raw string
`.add_files(new_files, cache=None)`	Return a new index with additional files
`.info()`	Return index metadata

`DynamicFileIndex`

Method	Description
`DynamicFileIndex(merge_threshold=50, cache=None, max_workers=8)`	Create instance
`.build_initial(base_files, verbose=False)`	Seed from bulk files
`.add_file(path, force_merge=False) -> bool`	Add one file (buffered)
`.add_files(paths, force_merge=False) -> bool`	Add many files (one merge check)
`.remove_file(path) -> bool`	Remove a file
`.force_merge()`	Trigger immediate rebuild
`.save(path)`	Persist to disk
`DynamicFileIndex.load(path, cache=None)`	Load from disk
`.query(main_file, top_n=None, threshold=0.0, sort=True)`	Query both index layers
`.query_text(text, top_n=None, threshold=0.0)`	Query by raw string
`.info() -> dict`	State metadata

`IndexRegistry`

Method	Description
`IndexRegistry(base_dir, merge_threshold=50, shared_cache=True)`	Open or create registry
`.create_group(name, initial_files=None, merge_threshold=None, overwrite=False)`	Create a group
`.delete_group(name)`	Delete a group
`.list_groups() -> list[str]`	All group names
`.group_info(name) -> dict`	Metadata for one group
`.all_info() -> dict`	Metadata for all groups
`.add_file(group, path) -> bool`	Add a file to a group
`.add_files(group, paths) -> bool`	Add multiple files to a group
`.remove_file(group, path) -> bool`	Remove a file from a group
`.force_merge(group)`	Force rebuild for a group
`.query(group, main_file, top_n=None, threshold=0.0)`	Query a single group
`.query_all(main_file, top_n=None, threshold=0.0) -> dict`	Query all groups
`.query_text(group, text, top_n=None, threshold=0.0)`	Query raw text
`.cache_stats() -> dict`	Cache statistics
`.close()`	Close SQLite connection

Return types

@dataclass
class PairResult:
    file_1: str
    file_2: str
    common_in_percentage: float  # 0.0 to 100.0
    def to_dict(self) -> dict: ...

@dataclass
class FileMatch:
    file: str
    percentage: float
    def to_dict(self) -> dict: ...

@dataclass
class MultiResult:
    file: str
    compare: list[FileMatch]
    def to_dict(self) -> dict: ...
    def top(self, n: int) -> MultiResult: ...

Performance Guide

Corpus size	Recommended approach	First build	Subsequent queries
< 100 files	`compare_one_to_many`	N/A	< 100ms
100 - 1,000	`compare_one_to_many` + cache	N/A	< 1s
1,000 - 50,000	`FileIndex` or `DynamicFileIndex`	30s - 5min*	< 50ms
50,000+	`FileIndex` with `max_features` tuning	varies	< 200ms

*On first build. With cache populated, subsequent rebuilds skip file I/O and only re-fit the TF-IDF matrix.

Reducing memory at scale:

The TF-IDF matrix is stored as a sparse matrix. For very large corpora you can reduce memory usage by capping the vocabulary size:

idx = FileIndex.build(
    base_files,
    vectorizer_kwargs={"max_features": 50_000},  # default is 200_000
)

Project Structure

filemetric-engine/
├── filemetric_engine/
│   ├── __init__.py        <- public API exports
│   ├── types.py           <- PairResult, MultiResult, FileMatch
│   ├── cache.py           <- VectorCache (SQLite, thread-safe, SHA-256)
│   ├── compare.py         <- compare_files(), compare_one_to_many()
│   ├── index.py           <- FileIndex, DynamicFileIndex
│   └── registry.py        <- IndexRegistry
├── tests/
│   ├── test_filemetric.py <- unit tests: core functions, cache, FileIndex
│   └── test_dynamic.py    <- unit tests: DynamicFileIndex, IndexRegistry
├── smoke_test.py          <- end-to-end manual test
├── pyproject.toml
├── LICENSE
└── README.md

Contributing

Contributions are welcome and appreciated.

# Fork the repo, then clone your fork
git clone https://github.com/your-username/filemetric-engine.git
cd filemetric-engine

# Create a feature branch
git checkout -b feature/your-feature-name

# Install in dev mode
pip install -e ".[dev]"

# Make your changes, then run the tests
pytest tests/ -v

# Push and open a pull request
git push origin feature/your-feature-name

Guidelines:

New features should include tests in tests/
Keep public API changes backward-compatible where possible
Confirm pytest tests/ -v shows 63 passed before submitting
Open an issue first if you are planning a large or breaking change

Testing

Run the full test suite:

pytest tests/ -v

Run the end-to-end smoke test:

python smoke_test.py

The smoke test exercises all layers: pair comparison, one-to-many, cache hit/miss, dynamic index with auto-merge, registry group lifecycle, and persistence across simulated restarts.

Expected output:

filemetric-engine -- smoke test
------------------------------------------------------------
  Created 6 sample documents in sample_docs/

============================================================
  TEST 1: Simple pair comparison
============================================================
...
============================================================
  ALL TESTS PASSED
============================================================

Author

Pharaoh GitHub: @0x0pharaoh

For bugs and feature requests, please open an issue. For questions or collaboration, reach out via GitHub.

License

MIT License — see LICENSE for full terms.

If you find this project useful, consider leaving a star on GitHub.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Apr 18, 2026

1.0.3

Apr 18, 2026

1.0.2

Apr 18, 2026

1.0.1

Apr 18, 2026

1.0.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filemetric_engine-1.1.0.tar.gz (28.1 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filemetric_engine-1.1.0-py3-none-any.whl (21.6 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file filemetric_engine-1.1.0.tar.gz.

File metadata

Download URL: filemetric_engine-1.1.0.tar.gz
Upload date: Apr 18, 2026
Size: 28.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filemetric_engine-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c4463d5206dc81ed52addab513afa703684c9f2fa7e70bf87da3a38e73d28426`
MD5	`8bb1173d7c2c4fee5b459e8ab7d6f089`
BLAKE2b-256	`1ab3dca3135b780822b65f90137d177c60f1f0419d04d36a9ad4154d4efb503f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filemetric_engine-1.1.0.tar.gz:

Publisher: pypi-publish.yml on 0x0pharaoh/filemetric-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filemetric_engine-1.1.0.tar.gz
- Subject digest: c4463d5206dc81ed52addab513afa703684c9f2fa7e70bf87da3a38e73d28426
- Sigstore transparency entry: 1338965284
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: 0x0pharaoh/filemetric-engine@9e015725a115a16cb0d7b5b22ae8bf785783157b
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/0x0pharaoh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@9e015725a115a16cb0d7b5b22ae8bf785783157b
- Trigger Event: push

File details

Details for the file filemetric_engine-1.1.0-py3-none-any.whl.

File metadata

Download URL: filemetric_engine-1.1.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 21.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filemetric_engine-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aba83ec10fc09cdf97d5e1dddc33e0b1489f7467c4c41a1b503e907e198ffc24`
MD5	`04053bc79e3bcb816f584a2919a3ee9d`
BLAKE2b-256	`f8246454546288f4e06988bf440d7a91dff310f3a8e7b11c7ecc3cef2e848096`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filemetric_engine-1.1.0-py3-none-any.whl:

Publisher: pypi-publish.yml on 0x0pharaoh/filemetric-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filemetric_engine-1.1.0-py3-none-any.whl
- Subject digest: aba83ec10fc09cdf97d5e1dddc33e0b1489f7467c4c41a1b503e907e198ffc24
- Sigstore transparency entry: 1338965290
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: 0x0pharaoh/filemetric-engine@9e015725a115a16cb0d7b5b22ae8bf785783157b
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/0x0pharaoh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@9e015725a115a16cb0d7b5b22ae8bf785783157b
- Trigger Event: push

filemetric-engine 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

filemetric-engine

What is filemetric-engine?

Table of Contents

How It Works

Similarity algorithm

Two-tier index design

Caching

Installation

Quick Start

Usage Guide

1. Simple Pair Comparison

2. One vs Many

3. Large Scale with FileIndex

4. Dynamic Index — Incremental Uploads

5. IndexRegistry — Named Groups

6. VectorCache

7. Query Raw Text

API Reference

compare_files

compare_one_to_many

FileIndex

DynamicFileIndex

IndexRegistry

Return types

Performance Guide

Project Structure

Contributing

Testing

Author

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`compare_files`

`compare_one_to_many`

`FileIndex`

`DynamicFileIndex`

`IndexRegistry`