Skip to main content

findlike is a package to retrieve similar documents

Project description

findlike

PyPI License Tests

findlike is a command-line tool written in Python that retrieves a list of similar files in relation to a reference file or an ad-hoc query. The tool is highly configurable and can be used as backend for other programs (e.g. personal knowledge management systems, Emacs, etc.)

Features:

  • Choose between BM25 and TF-IDF + cosine distance for similarity calculation
  • Recursive search option
  • Control over parameters like maximum number of results, whether to display similarity scores etc.
  • Optionally return results in JSON format
  • Multilingual support

Table of Contents

Prerequisites

  • Python 3.8 or higher
  • Additional dependencies as listed in the requirements.txt file

Installation

Using pip (single user)

To install findlike for your user only, run the following command in your terminal:

pip install --user findlike

Using pip and virtual environments

Or, if you wish to install findlike in a new virtual environment, first create and activate the environment:

python -m venv <virtual environment directory>
source <virtual environment directory>/bin/activate

Then run pip install findlike (without the --user flag).

Manual installation from source

Lastly, if you prefer to install findlike from this repository instead of fetching the package from PyPI:

# Clone this repository
git clone https://github.com/brunoarine/findlike.git

# Navigate into the findlike directory
cd findlike

# Install it as a Python package using `pip`:

pip install -e .

Optionally, you can create an alias for the findlike command to be accessible without activating its virtual environment:

# Replace .bashrc with .zshrc depending on your shell environment.
echo "alias findlike='/path/to/findlike/venv/bin/findlike'" >> ~/.bashrc
source ~/.bashrc

Usage

Here is the basic usage of findlike:

findlike [OPTIONS] [REFERENCE_FILE]

findlike works with either a reference file or a --query option. Once the reference text is set, findlike will scan a given directory (default is the current working dir), and return the most similar documents against the reference.

Options

Here's the breakdown of the available options in findlike:

--help

Displays a short summary of the available options.

-d, --directory PATH

Specify the directory that is going to be scanned. Default is current working directory. Example:

findlike -d /path/to/another/directory

-q, --query TEXT

Passes an ad-hoc query to the program, so that no reference file is required. Useful when you want to quickly find documents by an overall theme. Example:

findlike -q "earthquakes"

-f, --file-pattern

Specifies the file pattern to use when scanning the directories for similar files. The pattern uses glob convention, and should be passed with single or double quotes, otherwise your shell environment will likely try to expand it. Default is common plain-text file extensions (the full list can be seen here).

findlike -f "*.md" reference_file.txt

-R, --recursive

If used, this option makes findlike scans directories and their sub-directories as well. Example:

findlike reference_file.txt -R

-l, --language TEXT

Changing this value will impact stopwords filtering and word stemmer. The following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish. Default is English.

findlike reference_file.txt -l "portuguese"

-c, --min-chars INTEGER

Minimum document size (in number of characters) to be included in the corpus. Default is 1. Example:

findlike reference_file.txt -c 50

-A, --absolute-paths

Show the absolute path of each result instead of relative paths. Example:

findlike reference_file.txt -A

-m, --max-results INTEGER

Number of items to show in the final results. Default is 10.

findlike reference_file.txt -m 5

-p, --prefix TEXT

String to prepend each entry in the final results. You can set it to "* " or "- " to turn them into a Markdown or Org-mode list. Default is "", so that no prefix is shown. Example:

findlike reference_file.txt -p "- "

-h, --hide-reference

Remove the first result from the scores list. Useful if the reference file is in the scanned directory, and you don't want to see it included in the top of the results. This option has no effect if the --query option is used.

findlike reference_file.txt -h

-H, --heading TEXT

Text to show as the list heading. Default is "", so no heading title is shown. Example:

findlike reference_file.txt -H "## Similar files"

-F, --format [plain|json]

This option sets the output format. plain will print the results as a simple list, one entry per line. json will print the results as a valid JSON list with score and target as keys for each entry. Default is "plain". Example:

findlike reference_file.txt -F json

-t, --threshold FLOAT

Similarity score threshold. All results whose score are below the determined threshold will be omitted. Default is 0.05. Set it to 0 if you wish to show all results. Example:

findlike reference_file.txt -t 0

More Examples

To find similar documents in a directory (recursively):

findlike -R -d /path/to/directory reference_file.md 

To search files using a query instead of a reference file while filtering by extension:

findlike -q "black holes" -d /path/to/ayreon/lyrics -f "*.txt"

To show similarity scores and filenames in JSON format:

findlike -s -F json reference_file.md

To print the results table as a Markdown list:

findlike -H "# List of similar documents" -p "- " reference_file.txt

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd findlike
python -m venv venv
source venv/bin/activate

Now install the development dependencies:

pip install -e '.[dev]'
`` 

To run the tests:

```bash
pytest

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

Acknowledgements

  • Simon Willison for being an inspiration on releasing small but useful tools more often.
  • Sindre Sorhus for the comprehensive list of plain-text file extensions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findlike-1.3.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

findlike-1.3.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file findlike-1.3.0.tar.gz.

File metadata

  • Download URL: findlike-1.3.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for findlike-1.3.0.tar.gz
Algorithm Hash digest
SHA256 ef9fe0f777443297b395e49ce171689541f5af9225441899a6961e78a1dde687
MD5 22e20e0c1bdca945d0653f108f22019f
BLAKE2b-256 a2176d800881cc193ed4687ec6f4dfcf273ee531bff2662c4c9513c5db55bdfc

See more details on using hashes here.

File details

Details for the file findlike-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: findlike-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for findlike-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 15cf23dbfc05c7da7dd535bc57da55b5e54091f75861d96d7383340076b2a202
MD5 26670fb792696b325ebf387cd7121f58
BLAKE2b-256 7229bf55a75095c62174486857215889ed3dac242c49d63913cabaa8ebb42d7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page