Skip to main content

IOC extractor

Project description

threatrack_iocextract.py

Extracts IOCs (and other patterns) from text.

How do I use it?

Install

  • from https://pypi.org via pip:
pip install threatrack_iocextract
  • via setup.py:
sudo python setup.py install

Usage

import threatrack_iocextract

text = 'hxxp://bad[.]com/  https://evil.com/foobar?a=1 asfdas\nsadfasf bob at example dot com'

# extract IOCs
iocs1 = threatrack_iocextract.extract(text)
# iocs1 is dict like `{ioctype1 : [ioc1, ioc2, ...], ioctype2 : [ioc3, ...], ...}`
print(iocs1['url'][0]) # prints https://evil.com/foobar?a=1
print(iocs1['hostname'][0]) # prints evil.com

# extract defangled IOCs
iocs2 = threatrack_iocextract.extract_all(text)
print(iocs2['url'][0]) # prints http://bad.com/
print(iocs2['hostname'][0]) # prints bad.com
print(iocs2['email'][0]) # prints bob@example.com

How does this work?

threatrack_iocextract.py

The main python program. The workflow is:

  1. load_patterns(): Read, expand and configure search patterns (automatically called on import).
  2. refang(text): Turns defanged text into "refanged" text, e.g. 'hxxp://bad[.]com/' becomes 'http://bad.com/'.
  3. extract(text): Extract IOCs from text. Returns a dict like {ioctype1 : [ioc1, ioc2, ...], ioctype2 : [ioc3, ...], ...} (see [How do I use it?])
  4. extract_all(text): Like extract() but will also extract defanged IOCs.

patterns/

Contains the search pattern configuration.

patterns/patterns.csv

A tab separated list of search patterns. This is loaded by load_patterns(). The format is tab separated:

ioctype	regexpattern	regexoptions

It must always have 3 columns.

For example:

sha1	\b[a-f0-9]{40}\b	i

would search for SHA1 hashes between word boundaries.

Possible options are:

  • i (= re.I = case-insensitive)
  • s (= re.S = dot all)

All other regexoptions are ignored. regexoptions must be set (even if just an empty character)!

Expansions allow you to reuse other sources in patterns. There are two types of expansions:

  • %%file:name.csv%%
  • %%pattern:name%%

file: expansions allow to include file contents in patterns. name.csv is a file with single column list of patterns. Any pattern containing this expansion will replace %%file:name.csv%% with a regex trying to match each line of name.csv.

For example, if name.csv contains:

foo
bar
shoot

The pattern test%%file:name.csv%%?blah' will be expanded to test(?:foo|bar|shoot)?blah'. See patterns/{schemes,tlds}.csv for how this can be useful.

pattern: expansions allow to reuse previous patterns.

For example, if patterns.csv contains:

.port	6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{3}|[1-9][0-9]{2}|[1-9][0-9]|[0-9]	.
ipv4	(?:(?:25[0-5]|2[0-4][0-9]|1?[0-9]{1,2})\.){3}(?:25[0-5]|2[0-4][0-9]|1?[0-9]{1,2})(?:\:%%pattern:.port%%)?	.

ipv4 would be expanded to:

(?:(?:25[0-5]|2[0-4][0-9]|1?[0-9]{1,2})\.){3}(?:25[0-5]|2[0-4][0-9]|1?[0-9]{1,2})(?:\:(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{3}|[1-9][0-9]{2}|[1-9][0-9]|[0-9]))?

Names starting with . (dot) are private. They are not searched directly but can be used in expansions.

Gotchas

  • tlds.csv and schemes.csv must be sorted longest patterns first to ensure longest match. Otherwise .co would match before .com, etc.

Are there any issues?

Yes. Unfortunately, I can only list the known issues.

Known issues

Unconditional refangle

extract_all(text) will refang the text. This could potentially lead to altered IOCs, e.g. I am at home.To do so ... would be altered to I am@home.To do so ... and thus lead to the email IOC am@home.to.

Other possibilities are that IOCs get altered, e.g. http://example[.]com/foo[dot]bar/ would refangle to http://example.com/foo.bar/.

Unfortunately, it is very hard to fix this. Suggestions are welcome.

Possible mitigation: Add IOC specific refangs. So, e.g., only when ectracting email IOCs the s/at/@/ refang gets applied.

TODO

  • Fix IPv6 pattern, it overlaps with MAC addresses
  • Fix Hash and Bitcoin overlap
  • Increase whitelist
  • Fix MAC pattern to not match on fingerprints
  • Fix extract_all() refanging breaks YARA extraction. Need to find a smart solution to extracting defanged IOCs. :(
  • Add IOC sepcific refangs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

threatrack_iocextract-0.0.9.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

threatrack_iocextract-0.0.9-py2.py3-none-any.whl (14.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file threatrack_iocextract-0.0.9.tar.gz.

File metadata

  • Download URL: threatrack_iocextract-0.0.9.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.12.5 setuptools/39.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for threatrack_iocextract-0.0.9.tar.gz
Algorithm Hash digest
SHA256 8701560fac7d405f1f88826552aaa7be2346e773427d0788e304bb95006319bb
MD5 4245710055cfc5fc271cecc15172cb59
BLAKE2b-256 ba2f5eb1ed5978c944061e7e0118864df92f7d61f5c0252a512c3deaa934d885

See more details on using hashes here.

File details

Details for the file threatrack_iocextract-0.0.9-py2.py3-none-any.whl.

File metadata

  • Download URL: threatrack_iocextract-0.0.9-py2.py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.12.5 setuptools/39.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for threatrack_iocextract-0.0.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a49cbcbf56685dcb12b00e1f7bed932922f9c891dbc4759aa29351b23be1d872
MD5 2ca3541f4b9554ca49a8a6e5b0319fcc
BLAKE2b-256 c6675f5c753e7265dbd7a33fd86edec74ca33648a7d59671244499ff42a205c7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page