Skip to main content

This CLI tool compares files or directories with cosine similarity.

Project description

cosSim - See how similar your files are

It is hard to determine how similar two text files are. Without much complication, cosSim uses simple tokenization and vectorization with which a word similarity can be calculated.

This is very usefull in cases where the context is not important, but the spelling has a big impact (as in OCR with pdf files).

This Project has been brought to life with the help of the AfZ (Archive of Contemporary History) at the ETH Zürich.

Overview

The tool is suited to compare texts that do not depend on context, but rather rely on correct spelling. The output is presented in percent. Some use cases could be:

  • comparing two different OCR outputs to a ground truth

  • comparing hand written digitalized text with a ground truth

  • checking if your AI has a correct spelling regarding your ground truth

So if you want to get a similarity in terms of semantics, this is not the right tool for you.

The CLI tool uses the NLTK Library to tokenize the texts, Numpy to store the vector data and the cosine similarity to compare the vectors.

Guide

The following shows how to get and use cosSim.

Installation

$ pip install cosSim

If you would rather like to customize the code to your needs, grab a stable version under "Releases".

Usage

The CLI can be used in two ways. It is able to compare two files or directories to a ground truth. It can also compare one file or directory to a ground truth. The amout of files or directories is specified in the positional argument behind the command:

$ cosSim path_to_dir_or_file

or

$ cosSim path_to_dir_or_file another_path

The programm recognises with the --dir or --file flag, which kind of parsing you would like to do. So if you desire to compare two files to the integrated corpus, simply type:

$ cosSim path1 path2 --file

Because the integrated corpus mostly generates an output, that represents language similarty (that is not useful in many cases), cosSim accepts your ground truth under the --base flag:

$ cosSim path1 path2 --file --base path_to_ground_truth

Regarding language support right now, cosSim supports

  • german
  • english

tokenization as well as corpora. If neede, more language support will be added in the future. You can specify the language by adding de or en to the --lang flag. If no language is explicitly stated, the program defaults to german.

Of course you can access a help menu in within the CLI by adding --help or -h to the end of the line.

Common error messages

Because the program uses the nltk library, there is a possibility that an error occurs, which notes a missing installation. In order to prevent this from happening again, see their dedicated documentation regarding these rather small problems.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosSim-0.0.3.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

cosSim-0.0.3-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file cosSim-0.0.3.tar.gz.

File metadata

  • Download URL: cosSim-0.0.3.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for cosSim-0.0.3.tar.gz
Algorithm Hash digest
SHA256 86763c10677dfa8f10b950aa843b9bde64a74c7ec0eeb5e7cc08771b6f9d211d
MD5 4c3ae4e6b6cef83d25ae125fcaf78a8a
BLAKE2b-256 7cb499656715e7d1268b56d3f3bb4604b96cef3bdf5a35f407221f9edd75b2dd

See more details on using hashes here.

File details

Details for the file cosSim-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: cosSim-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for cosSim-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 532e27a727d9a0b518633faf8c45dfa243ae2affa05b62a4d1e947073ee5ec0e
MD5 6967a7ca78951af57754e84f78ebf67e
BLAKE2b-256 cfb6bb4cbb99df267c84fb44c4ac4cf173f89c83f5b579f72fbfefe0b23e7010

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page