Skip to main content

findlike is a package to retrieve similar documents

Project description

findlike

PyPI License Tests

findlike is a command-line tool written in Python that retrieves a list of similar files in relation to a reference file or an ad-hoc query. The tool is highly configurable and can be used as backend for other programs (e.g. personal knowledge management systems, Emacs, etc.)

Features:

  • Choose between BM25 and TF-IDF + cosine distance for similarity calculation
  • Recursive search option
  • Control over parameters like maximum number of results, whether to display similarity scores etc.
  • Optionally return results in JSON format
  • Multilingual support

Table of Contents

Prerequisites

  • Python 3.8 or higher
  • Additional dependencies as listed in the requirements.txt file

Installation

Using pip (single user)

To install findlike for your user only, run the following command in your terminal:

pip install --user findlike

Using pip and virtual environments

Or, if you wish to install findlike in a new virtual environment, first create and activate the environment:

python -m venv <virtual environment directory>
source <virtual environment directory>/bin/activate

Then run pip install findlike (without the --user flag).

Manual installation from source

Lastly, if you prefer to install findlike from this repository instead of fetching the package from PyPI:

# Clone this repository
git clone https://github.com/brunoarine/findlike.git

# Navigate into the findlike directory
cd findlike

# Install it as a Python package using `pip`:

pip install -e .

Optionally, you can create an alias for the findlike command to be accessible without activating its virtual environment:

# Replace .bashrc with .zshrc depending on your shell environment.
echo "alias findlike='/path/to/findlike/venv/bin/findlike'" >> ~/.bashrc
source ~/.bashrc

Usage

Here is the basic usage of findlike:

findlike [OPTIONS] [REFERENCE_FILE]

findlike works with either a reference file or a --query option. Once the reference text is set, findlike will scan a given directory (default is the current working dir), and return the most similar documents against the reference.

Options

Here's the breakdown of the available options in Findlike:

  --version                     Show the version and exit.
  -q, --query TEXT              query option if no reference file is provided
  -d, --directory PATH          directory to scan for similar files  [default:
                                (current directory)]
  -f, --filename-pattern TEXT   filename pattern matching  [default: *.*]
  -R, --recursive               recursive search
  -a, --algorithm [bm25|tfidf]  text similarity algorithm  [default: tfidf]
  -l, --language TEXT           stemmer and stopwords language  [default:
                                english]
  -c, --min-chars INTEGER       minimum document size (in number of
                                characters) to be considered  [default: 1]
  -A, --absolute-paths          show absolute rather than relative paths
  -m, --max-results INTEGER     maximum number of results  [default: 10]
  -p, --prefix TEXT             result lines prefix
  -s, --show-scores             show similarity scores
  -h, --hide-reference          remove REFERENCE_FILE from results
  -H, --heading TEXT            results list heading
  -F, --format [plain|json]     output format  [default: plain]
  -t, --threshold FLOAT         minimum score for a result to be shown
                                [default: 0.0]
  --help                        Show this message and exit.

Examples

To find similar documents in a directory (recursively):

findlike -R -d /path/to/directory reference_file.md 

To search files using a query instead of a reference file while filtering by extension:

findlike -q "black holes" -d /path/to/ayreon/lyrics -f "*.txt"

To show similarity scores and filenames in JSON format:

findlike -s -F json reference_file.md

To print the results table as a Markdown list:

findlike -H "# List of similar documents" -p "- " reference_file.txt

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd findlike
python -m venv venv
source venv/bin/activate

Now install the development dependencies:

pip install -e '.[dev]'

To run the tests:

pytest

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findlike-1.2.2.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

findlike-1.2.2-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file findlike-1.2.2.tar.gz.

File metadata

  • Download URL: findlike-1.2.2.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.17

File hashes

Hashes for findlike-1.2.2.tar.gz
Algorithm Hash digest
SHA256 6e70a63d5a2f73806d2faef80efe26e5b6a230aa207d651b2698c0c7d005373d
MD5 e6ac47edad2147334b3fedac7810cc3e
BLAKE2b-256 36e46f5856937071764bcf2eb08c123c023267296d2d3a6c5822caaceb4305ff

See more details on using hashes here.

File details

Details for the file findlike-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: findlike-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.17

File hashes

Hashes for findlike-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a33fef3f5c48d1619707c72a8beb0a43803df901436d0b2c81f17dd5fa07bb5f
MD5 5d11375eaa8152ceb1bfc9c0f2b7fd65
BLAKE2b-256 f321ede497bbbaaa48382d7471acc08d0dca6f4b7c886e613f5537ae3e66dabc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page