Skip to main content

A Python library to help with some common threat hunting data analysis operations

Project description

HuntLib

A Python library to help with some common threat hunting data analysis operations

Target’s CFC-Open-Source Slack

What's Here?

The huntlib module provides two major object classes as well as a few convenience functions.

  • ElasticDF: Search Elastic and return results as a Pandas DataFrame
  • SplunkDF: Search Splunk and return results as a Pandas DataFrame
  • data.read_json(): Read one or more JSON files and return a single Pandas DataFrame
  • data.read_csv(): Read one or more CSV files and return a single Pandas DataFrame
  • entropy() / entropy_per_byte(): Calculate Shannon entropy
  • promptCreds(): Prompt for login credentials in the terminal or from within a Jupyter notebook.
  • edit_distance(): Calculate how "different" two strings are from each other

huntlib.elastic.ElasticDF

The ElasticDF() class searches Elastic and returns results as a Pandas DataFrame. This makes it easier to work with the search results using standard data analysis techniques.

Example usage:

Create a plaintext connection to the Elastic server, no authentication

e = ElasticDF(
                url="http://localhost:9200"
)

The same, but with SSL and authentication

e = ElasticDF(
                url="https://localhost:9200",
                ssl=True,
                username="myuser",
                password="mypass"
)

Fetch search results from an index or index pattern for the previous day

df = e.search_df(
                  lucene="item:5282 AND color:red",
                  index="myindex-*",
                  days=1
)

The same, but do not flatten structures into individual columns. This will result in each structure having a single column with a JSON string describing the structure.

df = e.search_df(
                  lucene="item:5282 AND color:red",
                  index="myindex-*",
                  days=1,
                  normalize=False
)

A more complex example, showing how to set the Elastic document type, use Python-style datetime objects to constrain the search to a certain time period, and a user-defined field against which to do the time comparisons. The result size will be limited to no more than 1500 entries.

df = e.search_df(
                  lucene="item:5285 AND color:red",
                  index="myindex-*",
                  doctype="doc", date_field="mydate",
                  start_time=datetime.now() - timedelta(days=8),
                  end_time=datetime.now() - timedelta(days=6),
                  limit=1500
)

The search and search_df methods will raise InvalidRequestSearchException in the event that the search request is syntactically correct but is otherwise invalid. For example, if you request more results be returned than the server is able to provide. They will raise AuthenticationErrorSearchException in the event the server denied the credentials during login. They can also raise an UnknownSearchException for other situations, in which case the exception message will contain the original error message returned by Elastic so you can figure out what went wrong.

huntlib.splunk.SplunkDF

The SplunkDF class search Splunk and returns the results as a Pandas DataFrame. This makes it easier to work with the search results using standard data analysis techniques.

Example Usage

Establish an connection to the Splunk server. Whether this is SSL/TLS or not depends on the server, and you don't really get a say.

s = SplunkDF(
              host=splunk_server,
              username="myuser",
              password="mypass"
)

Fetch all search results across all time

df = s.search_df(
                  spl="search index=win_events EventCode=4688"
)

Fetch only specific fields, still across all time

df = s.search_df(
                  spl="search index=win_events EventCode=4688 | table ComputerName _time New_Process_Name Account_Name Creator_Process_ID New_Process_ID Process_Command_Line"
)

Time bounded search, 2 days prior to now

df = s.search_df(
                  spl="search index=win_events EventCode=4688",
                  days=2
)

Time bounded search using Python datetime() values

df = s.search_df(
                  spl="search index=win_events EventCode=4688",
                  start_time=datetime.now() - timedelta(days=2),
                  end_time=datetime.now()
)

Time bounded search using Splunk notation

df = s.search_df(
                  spl="search index=win_events EventCode=4688",
                  start_time="-2d@d",
                  end_time="@d"
)

Limit the number of results returned to no more than 1500

df = s.search_df(
                  spl="search index=win_events EventCode=4688",
                  limit=1500
)

NOTE: The value specified as the limit is also subject to a server-side max value. By default, this is 50000 and can be changed by editing limits.conf on the Splunk server. If you use the limit parameter, the number of search results you receive will be the lesser of the following values: 1) the actual number of results available, 2) the number you asked for with limit, 3) the server-side maximum result size. If you omit limit altogether, you will get the true number of search results available without subject to additional limits, though your search may take much longer to complete.

Return only specified fields NewProcessName and SubjectUserName

df = s.search_df(
                  spl="search index=win_events EventCode=4688",
                  fields="NewProcessName,SubjectUserName"
)

NOTE: By default, Splunk will only return the fields you reference in the search string (i.e. you must explicitly search on "NewProcessName" if you want that field in the results. Usually this is not what we want. When fields is not None, the query string will be rewritten with "| fields " at the end (e.g., search index=win_events EventCode=4688 | fields NewProcessName,SubjectUserName). This works fine for most simple cases, but if you have a more complex SPL query and it breaks, simply set fields=None in your function call to avoid this behavior.

SplunkDF will raise AuthenticationErrorSearchException during initialization in the event the server denied the supplied credentials.

Data Module

The huntlib.data module contains functions that make it easier to deal with data files.

Reading Multiple Data Files

huntlib provides two convenience functions to replace the standard Pandas read_json() and read_csv() functions. These replacement functions work exaclty the same as their originals, and take all the same arguments. The only difference is that they are capable of accepting a filename wildcard in addition to the name of a single file. All files matching the wildcard expression will be read and returned as a single DataFrame.

Start by importing the functions from the module:

from huntlib.data import read_csv, read_json

Here's an example of reading a single JSON file, where each line is a separate JSON document:

df = read_json("data.json", lines=True)

Similarly, this will read all JSON files in the current directory:

df = read_json("*.json", lines=True)

The read_csv function works the same way:

df = read_csv("data.csv)

or

df = read_csv("*.csv")

Consult the Pandas documentation for information on supported options for read_csv() and read_json().

Miscellaneous Functions

Entropy

We define two entropy functions, entropy() and entropy_per_byte(). Both accept a single string as a parameter. The entropy() function calculates the Shannon entropy of the given string, while entropy_per_byte() attempts to normalize across strings of various lengths by returning the Shannon entropy divided by the length of the string. Both return values are float.

>>> entropy("The quick brown fox jumped over the lazy dog.")
4.425186429663008
>>> entropy_per_byte("The quick brown fox jumped over the lazy dog.")
0.09833747621473352

The higher the value, the more data potentially embedded in it.

Credential Handling

Sometimes you need to provide credentials for a service, but don't want to hard-code them into your scripts, especially if you're collaborating on a hunt. huntlib provides the promptCreds() function to help with this. This function works well both in the terminal and when called from within a Jupyter notebook.

Call it like so:

(username, password) = promptCreds()

You can change one or both of the username/password prompts by passing arguments:

(username, password) = promptCreds(uprompt="LAN ID: ",
                                   pprompt="LAN Pass: ")

String Similarity

String similarity can be expressed in terms of "edit distance", or the number of single-character edits necessary to turn the first string into the second string. This is often useful when, for example, you want to find two strings that very similar but not identical (such as when hunting for process impersonation).

There are a number of different ways to compute similarity. huntlib provides the edit_distance() function for this, which supports several algorithms:

Here's an example:

>>> huntlib.edit_distance('svchost', 'scvhost')
1

You can specify a different algorithm using the method parameter. Valid methods are levenshtein, damerau-levenshtein, hamming, jaro and jaro-winkler. The default is damerau-levenshtein.

>>> huntlib.edit_distance('svchost', 'scvhost', method='levenshtein')
2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huntlib-0.4.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

huntlib-0.4.0-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file huntlib-0.4.0.tar.gz.

File metadata

  • Download URL: huntlib-0.4.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for huntlib-0.4.0.tar.gz
Algorithm Hash digest
SHA256 8bcd7387a00a289ba21db492645bb9832cbcb983563c7820b5032e38a8d3d7be
MD5 3c57c5ed38a0f81a47023d8fce48942e
BLAKE2b-256 689d0e8729201f34e7f7b93bcf306b94b19197a0b1e1b465a1e18a15b3232fe3

See more details on using hashes here.

File details

Details for the file huntlib-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: huntlib-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for huntlib-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 baf49edaed5b647732714fbb5f9fdda313db62c44f312bfe8c770f8c4afc09cc
MD5 5429d6a7c24b31cf870871c7b9e0395f
BLAKE2b-256 c0b42e626fb548d8995bab0c73eeba242f592ebdc0bb6195568b7fce53917972

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page