Skip to main content

certstream + analytics

Project description

Certstream + Analytics

Build Status codecov.io

Installation

The package can be installed from PyPI

pip install certstream-analytics

Quick usage

bin/domain_matching.py --domains domains.txt --dump-location certstream.txt

# The file domains.txt contains the list of domains that we want to monitor
# for matches (domains with similar names). For examples, a file with only
# two entries:
#
# gmail.com
# facebook.com
#
# will match any domains that contains gmail or facebook keywords.
#
# All the records consumed from certstream will be kept in certstream.txt

API

import time

from certstream_analytics.analysers import WordSegmentation
from certstream_analytics.analysers import IDNADecoder
from certstream_analytics.analysers import HomoglyphsDecoder

from certstream_analytics.transformers import CertstreamTransformer
from certstream_analytics.storages import ElasticsearchStorage
from certstream_analytics.stream import CertstreamAnalytics

done = False

# These analysers will be run in the same order
analyser = [
    IDNADecoder(),
    HomoglyphsDecoder(),
    WordSegmentation(),
]

# The following fields are filtered out and indexed:
# - String: domain
# - List: SAN
# - List: Trust chain
# - Timestamp: Not before
# - Timestamp: Not after
# - Timestamp: Seen
transformer = CertstreamTransformer()

# Indexed the data in Elasticsearch
storage = ElasticsearchStorage(hosts=['localhost:9200'])

consumer = CertstreamAnalytics(transformer=transformer,
                               storage=storage,
                               analyser=analyser)
# The consumer is run in another thread so this function is non-blocking
consumer.start()

while not done:
    time.sleep(1)

consumer.stop()

IDNA decoder

This analyser decode IDNA domain name into Unicode for further processing downstream. Normally, it will be the very first analyser to be run. If the analyser encounters a malform IDNA domain string, it will keep the domain as it is.

from certstream_analytics.analysers import IDNADecoder

decoder = IDNADecoder()

# Just an example dummy record
record = {
    'all_domains': [
        'xn--f1ahbgpekke1h.xn--p1ai',
    ]
}

# The domain name will now become 'укрэмпужск.рф'
print(decoder.run(record))

Homoglyphs decoder

There are lots of phishing websites that utilize homoglyphs to lure the victims. Some common examples include 'l' and 'i' or the Unicode character RHO '𝞀' and 'p'. The homoglyphs decoder uses the excellent confusable_homoglyphs to generate all potential alternative domain names in ASCII.

from certstream_analytics.analysers import HomoglyphsDecoder

# If the greedy flag is set, all alternative domains will be returned
decoder = HomoglyphsDecoder(greed=False)

# Just an example dummy record
record = {
    'all_domains': [
        # MATHEMATICAL MONOSPACE SMALL P
        '*.𝗉aypal.com',

        # MATHEMATICAL SAN-SERIF BOLD SMALL RHO
        '*.𝗉ay𝞀al.com',
    ]
}

# The domain name will now be converted to '*.paypal.com' with the ASCII
# character p
print(decoder.run(record))

Aho-Corasick

A domain and its SAN from Certstream will be compared against a list of most popular domains (from OpenDNS) using Aho-Corasick algorithm. This is a simple check to remove some of the most obvious phishing domains, for examples, www.facebook.com.msg40.site will match with facebook cause facebook is in the above list of most popular domains (I wonder how long it is going to last).

from certstream_analytics.analysers import AhoCorasickDomainMatching
from certstream_analytics.reporter import FileReporter

# Print the list of matching domains
reporter = FileReporter('matching-results.txt')

with open('opendns-top-domains.txt')) as fhandle:
    domains = [line.rstrip() for line in fhandle]

# The list of domains to match against
domain_matching_analyser = AhoCorasickDomainMatching(domains)

consumer = CertstreamAnalytics(transformer=transformer,
                               analyser=domain_matching_analyser,
                               reporter=reporter)

# Need to think about what to do with the matching result
consumer.start()

while not done:
    time.sleep(1)

consumer.stop()

Word segmentation

In order to improve the accuracy of the matching algorithm, we segment the domains into English words using wordsegment.

from certstream_analytics.analysers import WordSegmentation

wordsegmentation = WordSegmentation()

# Just an example dummy record
record = {
    'all_domains': [
        'login-appleid.apple.com.managesupport.co',
    ]
}

# The returned output is as follows:
#
# {
#   'analyser': 'WordSegmentation',
#   'output': {
#     'login-appleid.apple.com.managesuppport.co': [
#       'login',
#       'apple',
#       'id',
#       'apple',
#       'com',
#       'manage',
#       'support',
#       'co'
#     ],
# },
#
print(decoder.run(record))

Features generator

A list of features for each domain will also be generated so that they can be used for classification jobs further downstream. The list includes:

  • The number of dot-separated fields in the domain, for example, www.google.com has 3.
  • The overall length of the domain in characters.
  • The length of the longest dot-separate field .
  • The length of the TLD, e.g. .online (6) or .download (8) is longer than .com (3).
  • The randomness level of the domain. Nostril package is used to check how many words as returned by the WordSegmentation analyser are non-sense.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

certstream_analytics-0.1.7-py2.py3-none-any.whl (20.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file certstream_analytics-0.1.7-py2.py3-none-any.whl.

File metadata

  • Download URL: certstream_analytics-0.1.7-py2.py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.8

File hashes

Hashes for certstream_analytics-0.1.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9006ee15ee0d9a1edee6da1084ed61b5d7b831af24424a3e07756414b71a2364
MD5 1677863fd72b2f17686f777a4769856b
BLAKE2b-256 e3c43fff672b0eaeb93f88fb303ca71deafc84c779ad169ccd68caca61ee5a70

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page