Skip to main content

Python bindings for the TMK video similarity library

Project description

SWIG Python Bindings for TMK Video Hashing

TMK video hashing is designed to identify whether two video files are duplicates (possibly with different formats, etc.). It measures the similarity of videos in two ways. First a simple cosine distance on vectors of 256 dimensions are used to find possibly close videos. The documentation and whitepaper suggest a good threshold is 0.7. Then a temporal match kernel (tmk) is used to compute a more accurate similarity measure. Please see the original documentation at: https://github.com/facebook/ThreatExchange/

This repository provide Python bindings to make it easier to use TMK within Python.

Installation

tl;dr

git clone https://github.com/meedan/tmkpy
cd tmkpy
python setup.py build
python setup.py install

Longer version

Ensure swig is installed (sudo apt install swig) and then build this extension.

setup.py will first build the TMK C++ code per the instructions at tmk/cpp/ (namely run make in that directory). This will build a file called libtmk.a

It will then run swig and generate tmkpy.py and tmkpy_wrap.cpp. It will then compile everything. To install system wide, run python setup.py install

Usage

import tmkpy

#Hash a video
vid=tmkpy.hashVideo("test/chair-19-sd-bar.mp4","/usr/bin/ffmpeg")

#Write the hash to a file (second argument is a string used in error messages to identify the program and can be anything)
vid.writeToOutputFile("output.tmk","anything_here")

#Get the 256-dimensional vector that is used to compute level-1 scores. Level-1 scores are the cosine similarity of these vectors.
l1features=vid.getPureAverageFeature()

#Compute level-2 scores against other tmk files on disk
import glob
haystack=[f for f in glob.glob("test/*.tmk")]
scores=tmkpy.query(vid,haystack,1)
scores=[(x,y) for x,y in zip(haystack,scores)]
print(scores)

tmkpy.query(needle,haystack,threads) expects needle to an actual TMK object or the name of a tmkfile. Haystack is a list of tmk filenames. The function computes all level-2 scores and returns them in a list equal to the length of, and in the same order as, haystack. It does not compute level-1 scores: it just computes all level-2 scores for the files in haystack. If a file is invalid, the score will be -1. There is no need to filter invalid / missing filenames.

Error handling

tmkpy.query generally sollows errors quietly if there is a missing or invlaid filename in haystack. The one excpetion is if you are passing in a string for the needle and a file with that string name does not exist. In this case, tmkpy.query will throw an invalid_argument exception. This is raised in Python as a RuntimeError and can be handle in the normal Python way. E.g.,

try:
	scores=tmkpy.query("invalid-missing-file.tmk",["hs1","hs2"],1)
except RuntimeError as e:
	print(e)

Generates

fopen: No such file or directory
tmkpy: could not open "invalid-missing-file.tmk" for read.
tmkpy: failed to read needle "invalid-missing-file.tmk".
Failed to read needle from supplied filename

The first three lines are written to standard error in C++. The last line is the result of print(e) in Python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tmkpy-0.1.1.tar.gz (2.8 MB view details)

Uploaded Source

File details

Details for the file tmkpy-0.1.1.tar.gz.

File metadata

  • Download URL: tmkpy-0.1.1.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.4.2 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.5

File hashes

Hashes for tmkpy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 892075b981e00c6ec4f3b4ae65f9e9722a55c85fe2d67ef261f84536f2ca2e8d
MD5 986e040d392fe0382669c64d622d5c54
BLAKE2b-256 401381d08d0c8af7555169e866226df46bc2b1d32aedf463a788e083603b155f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page