Advanced Indicator of Compromise (IOC) extractor
Project description
Advanced Indicator of Compromise (IOC) extractor.
Overview
This library extracts URLs, IP addresses, MD5/SHA hashes, and YARA rules from text corpora. It includes obfuscated and “defanged” IOCs in the output, and optionally deobfuscates them.
The Problem
It is common practice for malware analysts or endpoint software to “defang” IOCs such as URLs and IP addresses, in order to prevent accidental exposure to live malicious content. Being able to extract and aggregate these IOCs is often valuable for analysts. Unfortunately, existing “IOC extraction” tools often pass right by them, as they are not caught by standard REGEX.
For example, the simple defanging technique of surrounding periods with brackets:
127[.]0[.]0[.]1
Existing tools that use a simple IP address REGEX will ignore this IOC entirely.
The Solution
By combining specially crafted REGEX with some custom postprocessing, we are able to both detect and deobfuscate “defanged” IOCs. This saves time and effort for the analyst, who might otherwise have to manually find and convert IOCs into machine-readable format.
A Simple Use Case
Many Twitter users post C2s or other valuable IOC information with defanged URLs. For example, this tweet from @InQuest:
Recommended reading and great work from @unit42_intel: https://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/ ... InQuest customers have had detection for threats delivered from hotfixmsupload[.]com since 6/3/2017 and cdnverify[.]net since 2/1/18.
If we run this through the extractor, we can easily pull out the URLs:
https://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/ hotfixmsupload[.]com cdnverify[.]net
Passing in refang=True at extraction time would remove the obfuscation, but since these are real IOCs, let’s leave them defanged in our documentation. :)
Installation
Just get it from pip:
pip install iocextract
Usage
Try extracting some defanged URLS:
>>> content = """ ... I really love example[.]com! ... All the bots are on hxxp://example.com/bad/url these days. ... C2: tcp://example[.]com:8989/bad ... """ >>> import iocextract >>> for url in iocextract.extract_urls(content): ... print url ... hxxp://example.com/bad/url tcp://example[.]com:8989/bad example[.]com tcp://example[.]com:8989/bad
Note that some URLs may show up twice if they are caught by multiple REGEXes.
If you want, you can also “refang”, or remove common obfuscation methods from IOCs:
>>> for url in iocextract.extract_urls(content, refang=True): ... print url ... http://example.com/bad/url http://example.com:8989/bad http://example.com http://example.com:8989/bad
You can even extract and decode hex-encoded URLs:
>>> content = '612062756e6368206f6620776f72647320687474703a2f2f6578616d706c652e636f6d2f70617468206d6f726520776f726473' >>> for url in iocextract.extract_urls(content): ... print url ... 687474703a2f2f6578616d706c652e636f6d2f70617468 >>> for url in iocextract.extract_urls(content, refang=True): ... print url ... http://example.com/path
All extract_* functions in this library return iterators, not lists. The benefit of this behavior is that iocextract can process extremely large inputs, with a very low overhead. However, if for some reason you need to iterate over the IOCs more than once, you will have to save the results as a list:
>>> list(iocextract.extract_urls(content)) ['hxxp://example.com/bad/url', 'tcp://example[.]com:8989/bad', 'example[.]com', 'tcp://example[.]com:8989/bad']
A command-line tool is also included:
$ iocextract -h usage: iocextract [-h] [--input INPUT] [--output OUTPUT] [--extract-ips] [--extract-urls] [--extract-yara-rules] [--extract-hashes] [--refang] [--strip-urls] Advanced Indicator of Compromise (IOC) extractor. If no arguments are specified, the default behavior is to extract all IOCs. optional arguments: -h, --help show this help message and exit --input INPUT default: stdin --output OUTPUT default: stdout --extract-ips --extract-urls --extract-yara-rules --extract-hashes --refang default: no --strip-urls remove possible garbage from the end of urls. default: no
Only URLs and IPv4 addresses can be “refanged”.
More Details
This library currently supports the following IOCs:
- IP Addresses
IPv4 fully supported
IPv6 partially supported
- URLs
With protocol specifier: http, https, tcp, udp, ftp, sftp, ftps
With [.] anchor, even with no protocol specifier
IPv4 and IPv6 (RFC2732) URLs are supported
Hex-encoded URLs with protocol specifier: http, https, ftp
URL-encoded URLs with protocol specifier: http, https, ftp, ftps, sftp
- Emails
Partially supported, anchoring on @
YARA rules
- Hashes
MD5
SHA1
SHA256
SHA512
For IPv4 addresses, the following defang techniques are supported:
Technique |
Defanged |
Refanged |
---|---|---|
. -> [.] |
1[.]1[.]1[.]1 |
1.1.1.1 |
. -> (.) |
1(.)1(.)1(.)1 |
1.1.1.1 |
Partial |
1[.1[.1.]1 |
1.1.1.1 |
Any combination |
1.)1[.1.)1 |
1.1.1.1 |
For URLs, the following defang techniques are supported:
Technique |
Defanged |
Refanged |
---|---|---|
. -> [.] |
example[.]com/path |
http://example.com/path |
. -> (.) |
example(.)com/path |
http://example.com/path |
Partial |
http://example[.com/path |
http://example.com/path |
/ -> [/] |
http://example.com[/]path |
http://example.com/path |
http:// example .com /path |
http://example.com/path |
|
:// -> __ |
http__example.com/path |
http://example.com/path |
hxxp |
hxxp://example.com/path |
http://example.com/path |
Any combination |
hxxp__ example( .com[/]path |
http://example.com/path |
Hex encoded |
687474703a2f2f6578616d706c652e636f6d2f70617468 |
http://example.com/path |
URL encoded |
http%3A%2F%2fexample%2Ecom%2Fpath |
http://example.com/path |
Note that the table above is not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the GitHub Issues.
Changelog
New features, improvements, and bugfixes for each release can be found in the GitHub releases.
Contributing
If you have a defang technique that doesn’t make it through the extractor, or if you find any bugs, PRs and Issues are always welcome. The library is released under a “BSD-New” (aka “BSD 3-Clause”) license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file iocextract-1.3.0.tar.gz
.
File metadata
- Download URL: iocextract-1.3.0.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11638af2e55ad2a3fc23169fc88628a9078571a0ffd421ea761aaf07dc55489d |
|
MD5 | 5d586e0a28a2819c494979b5d15b1b2c |
|
BLAKE2b-256 | 778c0a6e2ce9a8076792a14b39adf4b33fd8727678554cf9d3e975a6bb9e02f7 |
Provenance
File details
Details for the file iocextract-1.3.0-py2.py3-none-any.whl
.
File metadata
- Download URL: iocextract-1.3.0-py2.py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 992f4570c1364b7801126be0787267dd6137f9b1299e5b8335ac48207407a705 |
|
MD5 | af22fe8451e90d2edc6716942e584c24 |
|
BLAKE2b-256 | c1d64c7c54bfd941c7f94dc4629c69b95d9bec059752ce354719d6dc9711873c |