Skip to main content

TLSH (C++ Python extension)

Project description

TLSH - C++ extension for Python

TLSH (Trend Micro Locality Sensitive Hash) is a fuzzy matching library. Given a byte stream with a minimum length of 50 bytes TLSH generates a hash value which can be used for similarity comparisons. Similar objects will have similar hash values which allows for the detection of similar objects by comparing their hash values. Note that the byte stream should have a sufficient amount of complexity. For example, a byte stream of identical bytes will not generate a hash value.

What's new in py-tlsh 4.12.1

This Python module supercedes the python-tlsh package on PyPi.

  • Making compatible with recent versions of Python 3.
  • Address Issues 125 There is a memory leak in py-tlsh and Issue 150 at https://github.com/trendmicro/tlsh Thanks to Susmit Yenkar for memory leak fix.

The improvements in 4.7.2 were:

  • lvalue / q1ratio / q2ratio / checksum / bucket_value / is_valid

The improvements in 4.5.0 were:

  • fixed this package so that it works on Windows
  • compatibility with VirusTotal adoption of TLSH: updated to the T1 hash format with backwards compatibility for old hashes
  • fixed the q3=0 divide by zero bug issue 79

Usage

import tlsh

tlsh.hash(data)

Note data needs to be bytes - not a string. This is because TLSH is for binary data and binary data can contain a NULL (zero) byte.

In default mode the data must contain at least 50 bytes to generate a hash value and that it must have a certain amount of randomness. To get the hash value of a file, try

tlsh.hash(open(file, 'rb').read())

Note: the open statement has opened the file in binary mode.

Example

import tlsh

h1 = tlsh.hash(data)
h2 = tlsh.hash(similar_data)
score = tlsh.diff(h1, h2)

h3 = tlsh.Tlsh()
with open('file', 'rb') as f:
    for buf in iter(lambda: f.read(512), b''):
        h3.update(buf)
    h3.final()
# this assertion is stating that the distance between a TLSH and itself must be zero
assert h3.diff(h3) == 0
score = h3.diff(h1)

Extra Options

The diffxlen function removes the file length component of the tlsh header from the comparison.

tlsh.diffxlen(h1, h2)

If a file with a repeating pattern is compared to a file with only a single instance of the pattern, then the difference will be increased if the file lenght is included. But by using the diffxlen function, the file length will be removed from consideration.

Backwards Compatibility Options

If you use the "conservative" option, then the data must contain at least 256 characters. For example,

import os
tlsh.conservativehash(os.urandom(256))

should generate a hash, but

tlsh.conservativehash(os.urandom(100))

will generate TNULL as it is less than 256 bytes.

If you need to generate old style hashes (without the "T1" prefix) then use

tlsh.oldhash(os.urandom(100))

The old and conservative options may be combined:

tlsh.oldconservativehash(os.urandom(500))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_tlsh-4.12.1.tar.gz (45.4 kB view details)

Uploaded Source

File details

Details for the file py_tlsh-4.12.1.tar.gz.

File metadata

  • Download URL: py_tlsh-4.12.1.tar.gz
  • Upload date:
  • Size: 45.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.7.9

File hashes

Hashes for py_tlsh-4.12.1.tar.gz
Algorithm Hash digest
SHA256 4b312d122c8a204d0b7ae7632c3b762160351887a19b5e451ef63da86dc231c1
MD5 e79723c2f6e06c9b9d69a0b8328ede5e
BLAKE2b-256 696b94f9aee3904016a99d2cc9029ff9999f6058d8351e740730ad1a2bdba291

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page