Skip to main content

findlike is a package to retrieve similar documents

Project description

findlike

PyPI License

findlike is a command-line tool that enables users to find similar documents in relation to a reference file or an ad-hoc query. This project is written in Python and utilizes well-known libraries that are optimized for performance.

Features:

  • Choose between BM25 and TF-IDF + cosine distance for similarity calculation
  • Recursive search option
  • Control over output format, document size to consider, maximum results to show, etc.
  • Multilingual support

Table of Contents

Getting Started

These instructions will guide you through the process of installing and using findlike on your local machine.

Prerequisites

  • Python 3.7 or higher
  • Additional dependencies as listed in the requirements.txt file

Installation

To install findlike, follow the steps below:

pip install --user findlike

If you prefer to download the repository instead:

# Clone this repository
git clone https://github.com/brunoarine/findlike.git

# Navigate into the findlike directory
cd findlike

# Install the required dependencies
pip install -r requirements.txt

# Add an alias for the findlike command (Optional)
echo "alias findlike='python /path/to/findlike/main.py'" >> ~/.bashrc
source ~/.bashrc

Usage

Here is the basic usage of findlike:

findlike [OPTIONS] [REFERENCE_FILE]

findlike will scan a given directory and return the most similar documents in relation to either a reference file or a query passed to with by the --query option.

Options

Here's the breakdown of the available options in Findlike:

  --version                     Show the version and exit.
  -q, --query TEXT              query option if no reference file is provided
  -d, --directory PATH          directory to scan for similar files  [default:
                                (current directory)]
  -f, --filename-pattern TEXT   filename pattern matching  [default: *.*]
  -R, --recursive               recursive search
  -a, --algorithm [bm25|tfidf]  text similarity algorithm  [default: tfidf]
  -l, --language TEXT           stemmer and stopwords language  [default:
                                english]
  -c, --min-chars INTEGER       minimum document size (in number of
                                characters) to be considered  [default: 1]
  -A, --absolute-paths          show absolute rather than relative paths
  -m, --max-results INTEGER     maximum number of results  [default: 10]
  -p, --prefix TEXT             result lines prefix
  -s, --show-scores             show similarity scores
  -h, --hide-reference          remove REFERENCE_FILE from results
  -H, --heading TEXT            results list heading
  -F, --format [plain|json]     output format  [default: plain]
  -t, --threshold FLOAT         minimum score for a result to be shown
                                [default: 0.0]
  --help                        Show this message and exit.

Examples

To find similar documents in a directory (recursively):

findlike -R -d /path/to/directory reference_file.md 

To search files using a query instead of a reference file while filtering by extension:

findlike -q "black holes" -d /path/to/ayreon/lyrics -f "*.txt"

To show similarity scores and filenames in JSON format:

findlike -s -F json reference_file.md

To print the results table as a Markdown list:

findlike -H "# List of similar documents" -p "- " reference_file.txt

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findlike-1.0.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

findlike-1.0.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file findlike-1.0.0.tar.gz.

File metadata

  • Download URL: findlike-1.0.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for findlike-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cf79088e4349b3ebbd0a89587a7019d48f15e5fe0ad91eb13ff3021498d0a5de
MD5 04f4a69f8e56d229a78e3d59ffb62259
BLAKE2b-256 917b5a4d7ca703de40024888988c76b4b59e9ac881284893210a5230782ae618

See more details on using hashes here.

File details

Details for the file findlike-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: findlike-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for findlike-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4787a206ac6b7373a23763af93effe475565a5efa0d71973a8f05e4eb1fec325
MD5 c202668221837ec58a5aed87a6e61d94
BLAKE2b-256 e4de45cddf858714af8db4f44b4744fb32e269031811dd5efbd041d440cc4307

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page