certstream + analytics

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Certstream + Analytics

Installation

The package can be installed from PyPI

pip install certstream-analytics

Quick usage

bin/domain_matching.py --domains domains.txt --dump-location certstream.txt

# The file domains.txt contains the list of domains that we want to monitor
# for matches (domains with similar names). For examples, a file with only
# two entries:
#
# gmail.com
# facebook.com
#
# will match any domains that contains gmail or facebook keywords.
#
# All the records consumed from certstream will be kept in certstream.txt

API

import time

from certstream_analytics.analysers import WordSegmentation
from certstream_analytics.analysers import IDNADecoder
from certstream_analytics.analysers import HomoglyphsDecoder

from certstream_analytics.transformers import CertstreamTransformer
from certstream_analytics.storages import ElasticsearchStorage
from certstream_analytics.stream import CertstreamAnalytics

done = False

# These analysers will be run in the same order
analyser = [
    IDNADecoder(),
    HomoglyphsDecoder(),
    WordSegmentation(),
]

# The following fields are filtered out and indexed:
# - String: domain
# - List: SAN
# - List: Trust chain
# - Timestamp: Not before
# - Timestamp: Not after
# - Timestamp: Seen
transformer = CertstreamTransformer()

# Indexed the data in Elasticsearch
storage = ElasticsearchStorage(hosts=['localhost:9200'])

consumer = CertstreamAnalytics(transformer=transformer,
                               storage=storage,
                               analyser=analyser)
# The consumer is run in another thread so this function is non-blocking
consumer.start()

while not done:
    time.sleep(1)

consumer.stop()

IDNA decoder

This analyser decode IDNA domain name into Unicode for further processing downstream. Normally, it will be the very first analyser to be run. If the analyser encounters a malform IDNA domain string, it will keep the domain as it is.

from certstream_analytics.analysers import IDNADecoder

decoder = IDNADecoder()

# Just an example dummy record
record = {
    'all_domains': [
        'xn--f1ahbgpekke1h.xn--p1ai',
    ]
}

# The domain name will now become 'укрэмпужск.рф'
print(decoder.run(record))

Homoglyphs decoder

There are lots of phishing websites that utilize homoglyphs to lure the victims. Some common examples include 'l' and 'i' or the Unicode character RHO '𝞀' and 'p'. The homoglyphs decoder uses the excellent confusable_homoglyphs to generate all potential alternative domain names in ASCII.

from certstream_analytics.analysers import HomoglyphsDecoder

# If the greedy flag is set, all alternative domains will be returned
decoder = HomoglyphsDecoder(greed=False)

# Just an example dummy record
record = {
    'all_domains': [
        # MATHEMATICAL MONOSPACE SMALL P
        '*.𝗉aypal.com',

        # MATHEMATICAL SAN-SERIF BOLD SMALL RHO
        '*.𝗉ay𝞀al.com',
    ]
}

# The domain name will now be converted to '*.paypal.com' with the ASCII
# character p
print(decoder.run(record))

Aho-Corasick

A domain and its SAN from Certstream will be compared against a list of most popular domains (from OpenDNS) using Aho-Corasick algorithm. This is a simple check to remove some of the most obvious phishing domains, for examples, www.facebook.com.msg40.site will match with facebook cause facebook is in the above list of most popular domains (I wonder how long it is going to last).

from certstream_analytics.analysers import AhoCorasickDomainMatching
from certstream_analytics.reporter import FileReporter

# Print the list of matching domains
reporter = FileReporter('matching-results.txt')

with open('opendns-top-domains.txt')) as fhandle:
    domains = [line.rstrip() for line in fhandle]

# The list of domains to match against
domain_matching_analyser = AhoCorasickDomainMatching(domains)

consumer = CertstreamAnalytics(transformer=transformer,
                               analyser=domain_matching_analyser,
                               reporter=reporter)

# Need to think about what to do with the matching result
consumer.start()

while not done:
    time.sleep(1)

consumer.stop()

Word segmentation

In order to improve the accuracy of the matching algorithm, we segment the domains into English words using wordsegment.

from certstream_analytics.analysers import WordSegmentation

wordsegmentation = WordSegmentation()

# Just an example dummy record
record = {
    'all_domains': [
        'login-appleid.apple.com.managesupport.co',
    ]
}

# The returned output is as follows:
#
# {
#   'analyser': 'WordSegmentation',
#   'output': {
#     'login-appleid.apple.com.managesuppport.co': [
#       'login',
#       'apple',
#       'id',
#       'apple',
#       'com',
#       'manage',
#       'support',
#       'co'
#     ],
# },
#
print(decoder.run(record))

Features generator

A list of features for each domain will also be generated so that they can be used for classification jobs further downstream. The list includes:

The number of dot-separated fields in the domain, for example, www.google.com has 3.
The overall length of the domain in characters.
The length of the longest dot-separate field .
The length of the TLD, e.g. .online (6) or .download (8) is longer than .com (3).
The randomness level of the domain. Nostril package is used to check how many words as returned by the WordSegmentation analyser are non-sense.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.7

Nov 15, 2019

0.1.6

Nov 15, 2019

0.1.5

Jan 3, 2019

0.1.4

Dec 17, 2018

0.1.3

Oct 15, 2018

0.1.2

Oct 12, 2018

0.1

Oct 11, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

certstream_analytics-0.1.7-py2.py3-none-any.whl (20.6 kB view details)

Uploaded Nov 15, 2019 Python 2Python 3

File details

Details for the file certstream_analytics-0.1.7-py2.py3-none-any.whl.

File metadata

Download URL: certstream_analytics-0.1.7-py2.py3-none-any.whl
Upload date: Nov 15, 2019
Size: 20.6 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.8

File hashes

Hashes for certstream_analytics-0.1.7-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`9006ee15ee0d9a1edee6da1084ed61b5d7b831af24424a3e07756414b71a2364`
MD5	`1677863fd72b2f17686f777a4769856b`
BLAKE2b-256	`e3c43fff672b0eaeb93f88fb303ca71deafc84c779ad169ccd68caca61ee5a70`

See more details on using hashes here.

certstream-analytics 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Certstream + Analytics

Installation

Quick usage

API

IDNA decoder

Homoglyphs decoder

Aho-Corasick

Word segmentation

Features generator

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes