A library and command line tool for extracting indicators of compromise (IOCs) from security reports in PDF, HTML, or text formats.
Project description
iocsearcher
iocsearcher is a Python library and command-line tool to extract indicators of compromise (IOCs), also known as cyber observables, from HTML, PDF, and text files. It can identify both defanged (e.g., URL hxxp://example[DOT]com) and unmodified IOCs (e.g., URL http://example.com).
Installation
pip install iocsearcher
Supported IOCs
iocsearcher can extract the following IOC types:
- URLs (url)
- Domain names (fqdn)
- IP addresses (ip4, ip6)
- IP subnets (ip4Net)
- Hashes (md5, sha1, sha256)
- Email addresses (email)
- Phone numbers (phoneNumber)
- Copyright strings (copyright)
- CVE vulnerability identifiers (cve)
- Tor v3 addresses (onionAddress)
- Social network handles (facebookHandle, githubHandle, instagramHandle, linkedinHandle, pinterestHandle, telegramHandle, twitterHandle, whatsappHandle, youtubeHandle, youtubeChannel)
- Advertisement/analytics identifiers (googleAdsense, googleAnalytics, googleTagManager)
- Blockchain addresses (bitcoin, bitcoincash, cardano, dashcoin, dogecoin, ethereum, litecoin, monero, ripple, tezos, tronix, zcash)
- Payment addresses (webmoney)
- Chinese Internet Content Provider licenses (icp)
- Bank account numbers (iban)
- Trademarks (trademark)
- Universal unique identifiers (uuid)
- Android package name (packageName)
- Spanish NIF identifiers (nif)
Command Line Usage
To find IOCs in a given file just provide the -f (--file) option. By default, found IOCs are printed to stdout, defanged IOCs are rearmed, and IOCs are deduplicated so they only appear once.
iocsearcher -f file.pdf
iocsearcher -f page.html
iocsearcher -f input.txt
You can use the -o (--output) option to place IOCs to a file instead of stdout:
iocsearcher -f file.pdf -o iocs.txt
By default all regexp are applied to the input. If you are only interested in some specific IOC types, it is more efficient to specify those using the -t (--target) option, which can be applied multiple times:
iocsearcher -f file.pdf -t url -t email
You can also search for IOCs in all files in a directory using the -d (--dir) option. IOCs extracted from each file will be placed in their own .iocs file. You can also place all IOCs founds across the input files in the same output file by also adding the -o (--output) option:
iocsearcher -d directoryWithFiles -o all.iocs
In HTML files, only the readable text is examined (i.e., think of the text shown by Firefox's Reader View). If you want to scan the whole HTML content you can use the -r (--raw) option:
iocsearcher -f page.html -r
If you have a file that you want to interpret as text avoiding filetype detection, you can use the -F (--forcetext) option:
iocsearcher -f input.txt -F
You can store the text extracted from a PDF/HTML file using the -T (--text) option, which will produce a .text file for each input file:
iocsearcher -f file.pdf -T
By default IOCs are deduplicated, you can instead output the offset of each IOC without deduplication by using the -v (--verbose) option:
iocsearcher -f file.pdf -v
You can also produce a ranking of IOCs by number of appearances (without deduplication) by using the -C (--count) option:
iocsearcher -f file.pdf -C -o rank.iocs
Library Usage
You can also use iocsearcher as a library by creating a Searcher object and then invoking the functions search_data to identify rearmed and deduplicated IOCs and search_raw to identify all matches, their offsets, and the defanged string. The Searcher object needs to be created only once to parse the regexps. Then, it can be reused to find IOCs in multiple input strings.
python3
>>> import iocsearcher
>>> from iocsearcher.searcher import Searcher
>>> test = 'Find this email contact[AT]example[dot]com'
>>> searcher = Searcher()
>>> searcher.search_data(test)
{('email', 'contact@example.com'), ('fqdn', 'example.com')}
>>> searcher.search_data(test, targets={'email'})
{('email', 'contact@example.com')}
>>> searcher.search_raw(test)
[('email', 'contact@example.com', 16, 'contact[AT]example[dot]com'), ('fqdn', 'example.com', 27, 'example[dot]com')]
You can also open a document without needing to provide its type, get its text, and then use a Searcher object to search for IOCs in the text. For example, if you have a file called file.pdf you can do:
python3
>>> import iocsearcher
>>> from iocsearcher.document import open_document
>>> from iocsearcher.searcher import Searcher
>>> doc = open_document("file.pdf")
>>> text,_ = doc.get_text() if doc is not None else ""
>>> searcher = Searcher()
>>> searcher.search_data(text)
If the file is not a PDF, HTML, or text document, open_document throws a warning and returns None
Defang and Rearm
Many security reports defang (i.e., remove the teeth from) malicious indicators, especially network indicators such as URLs, domains, IP addresses, and email addresses. This practice helps to prevent users from inadvertently clicking on a malicious indicator and start a network connection to it. Defanged indicators do not follow the indicator specification and thus require relaxed regular expressions to detect them.
iocsearcher supports some popular defang operations and rearms the IOCs by default so that deduplication works even if the same IOC has been defanged in different ways. However, it is not possible to support all defang operations, as every analyst can come up with their own. If you think iocsearcher is missing support for some popular defang operation, let us know by providing pointers to reports that use them.
Customizing the Regular Expressions
iocsearcher reads its regular expressions from an INI configuration file. If you want to modify a regexp, add a regexp, change the IOC type associated to a regexp, or disable validation for an existing regexp, you can create a copy of the patterns.ini file in the GitHub repo, edit your copy, and pass it as input to iocsearcher using the -P (--patterns) option:
iocsearcher -f file.pdf -P mypatterns.ini
Note that if you add a new regexp, the output will be the outermost group if a group exists, and the whole match if the regexp has no groups.
Related Tools
There exist multiple other open-source IOC extraction tools and we developed iocsearcher to improve on those. In our FGCS journal paper we propose a novel evaluation methodology for IOC extraction tools and apply it to compare iocsearcher with the following tools:
- Jager (Python)
- IOC-parser (Python)
- Cacador (Go)
- CyObstract (Python)
- IOC Finder (Python)
- IOC Extract (Python)
- IOC-Extractor (Python)
We believe the results show iocsearcher performs generally best, but that is up to you to judge. We encourage you to read our paper if you have questions about how iocsearcher compares with the above tools and to try the above tools if iocsearcher does not meet your goals.
Filtering
Technically speaking, iocsearcher is an indicator extraction tool, i.e., it extracts indicators regardless if they are benign or malicious. Currently, iocsearcher, similar to most other tools mentioned above, does not differentiate malicious indicators (i.e., IOCs) from benign indicators. For example, it will extract all URLs in the given input, regardless if they are malicious or benign.
Filtering of benign indicators is typically application-specific, so we prefer to keep it as a separate step. Such filtering is oftentimes performed with blocklists or through Natural Language Processing (NLP) techniques.
License
iocsearcher is released under the MIT license
This repository includes Base58 decoding code from the monero-python project. That code is located in the iocsearcher/monero folder and it is licensed under BSD 3-Clause.
References
The design and evaluation of iocsearcher and the comparison with prior IOC extraction tools are detailed in our FGCS journal paper:
Juan Caballero, Gibran Gomez, Srdjan Matic, Gustavo Sánchez, Silvia Sebastián, and Arturo Villacañas.
GoodFATR: A Platform for Automated Threat Report Collection and IOC Extraction.
In Future Generation Computer Systems, 2023.
Contributors
The main developer and maintainer for iocsearcher is Juan Caballero. Other members of the MaliciaLab at the IMDEA Software Institute have contributed fixes and helped with testing: Gibran Gomez, Silvia Sebastian, Srdjan Matic
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file iocsearcher-2.3.0.tar.gz
.
File metadata
- Download URL: iocsearcher-2.3.0.tar.gz
- Upload date:
- Size: 41.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c949535038686cd86254216bfa7530957d5ef65629b2018c5201572e73361cb4 |
|
MD5 | e502e20a617706262a3efb97434adca2 |
|
BLAKE2b-256 | b3b6e8ae9596953c18a7c24753ac8ad2a34096e0f1fb5193441088f37cc6e206 |
File details
Details for the file iocsearcher-2.3.0-py3-none-any.whl
.
File metadata
- Download URL: iocsearcher-2.3.0-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc3d576378a0915054c2243c57f2efb1a85dbdab73a61be69520af46e6edc289 |
|
MD5 | dd6afa79968160b42e370251d17e065e |
|
BLAKE2b-256 | 7720e187d3915507d78474b5bb968f747b2033797513a0bb2ba914e545fb18a9 |