ELF Binary Analysis Tool - Label malware or benignware datasets
Project description
ELF Binary Labeler
A powerful Python tool for analyzing and labeling ELF binary datasets, designed for malware and benignware classification. This tool extracts comprehensive metadata from binary files including CPU architecture, endianness, packing information, and malware family classification.
Features
-
Dual Mode Operation
- Malware Mode: Analyze VirusTotal JSON reports combined with binary files
- Benignware Mode: Direct binary analysis without JSON reports
-
Comprehensive Binary Analysis
- ELF header information (CPU, architecture, endianness, file type)
- Binary metadata (bits, load segments, section headers)
- File hashing (MD5, SHA256)
- Packing detection using DiE (Detect It Easy)
- Malware family classification using AVClass
-
Performance Optimized
- Multi-process parallel processing
- Progress tracking with tqdm
- Efficient single-pass file reading
-
Modern Architecture
- Modular design with separation of concerns
- Factory pattern for extensibility
- Abstract base class for easy extension
- Managed by modern Python tooling (uv, pyproject.toml)
Prerequisites
Required Tools
-
Python 3.10+
-
DiE (Detect It Easy) - for packing detection
- Download from: https://github.com/horsicq/Detect-It-Easy
- Ensure
dieccommand is available in PATH
-
AVClass - for malware family classification (malware mode)
- Automatically installed via Python dependencies
- Or manually install:
pip install avclass-malicialab
Installation
Method 1: Install from PyPI (Recommended)
pip install pyelflabeler
After installation, you can run the tool using the pyelflabeler command:
pyelflabeler --help
Method 2: Install from source with uv
uv is a fast Python package installer and resolver.
-
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Clone and install:
git clone https://github.com/louiskyee/pyelflabeler.git cd pyelflabeler uv sync
-
Run the tool:
uv run pyelflabeler --help # Or use Python module directly uv run python -m src.main --help
Method 3: Install from source with pip
-
Clone this repository:
git clone https://github.com/louiskyee/pyelflabeler.git cd pyelflabeler
-
Install in editable mode:
pip install -e .
-
Verify installation:
pyelflabeler --help diec --version
Usage
Malware Mode
Analyze VirusTotal JSON reports combined with binary files:
pyelflabeler --mode malware \
-i /path/to/json_reports \
-b /path/to/malware/binaries \
-o malware_output.csv
Expected Directory Structure:
Both JSON reports and binaries are organized by SHA256 hash prefix:
/path/to/json_reports/
├── 00/
│ ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11.json
│ ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e.json
│ └── ...
├── 01/
│ └── ...
└── ...
/path/to/malware/binaries/
├── 00/
│ ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11
│ ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e
│ └── ...
├── 01/
│ └── ...
└── ...
Files are organized in subdirectories named by the first two characters of their SHA256 hash.
Benignware Mode
Analyze binary files directly without JSON reports:
pyelflabeler --mode benignware \
-b /path/to/benignware/binaries \
-o benignware_output.csv
Command Line Options
| Option | Short | Description | Required |
|---|---|---|---|
--mode |
-m |
Analysis mode: malware or benignware |
No (default: malware) |
--input_folder |
-i |
Folder containing JSON reports | Yes (malware mode only) |
--binary_folder |
-b |
Folder containing binary files | Yes (both modes) |
--output |
-o |
Output CSV file path | No (auto-generated) |
Output Format
The tool generates a CSV file with the following columns:
| Column | Description |
|---|---|
file_name |
SHA256 hash of the binary |
md5 |
MD5 hash |
label |
Classification: Malware or Benignware |
file_type |
ELF file type (EXEC, DYN, REL, CORE) |
CPU |
CPU architecture (e.g., x86-64, ARM) |
bits |
Binary bits (32 or 64) |
endianness |
Byte order (little/big endian) |
load_segments |
Number of PT_LOAD segments |
is_stripped |
Whether symbol table is stripped (True/False) |
has_section_name |
Whether section headers exist |
family |
Malware family (malware mode only) |
first_seen |
First seen timestamp (malware mode) |
size |
File size in bytes |
diec_is_packed |
Whether binary is packed (True/False) |
diec_packer_info |
Packer name and version |
diec_packing_method |
Packing method details |
Example Output
file_name,md5,label,file_type,CPU,bits,endianness,load_segments,has_section_name,family,first_seen,size,diec_is_packed,diec_packer_info,diec_packing_method
01a2b3c4...,5e6f7g8h...,Malware,EXEC,Advanced Micro Devices X86-64,64,2's complement little endian,2,True,mirai,2024-01-15,45678,True,UPX(3.95),NRV
Error Handling
- Errors and warnings are logged to
{output_filename}_errors.log - Failed file analyses continue processing remaining files
- Detailed debug information available in log files
Performance
- High-speed parallel processing utilizing all available CPU cores
- Optimized single-pass file reading for ELF analysis
- Progress bars for real-time status updates
Project Structure
The project follows modern Python best practices with a modular architecture:
dataset_labeler/
├── main.py # CLI entry point
├── pyproject.toml # Project configuration (uv)
├── requirements.txt # Legacy pip support
├── src/
│ ├── main.py # Main CLI logic
│ ├── config.py # Configuration management
│ ├── constants.py # CSV field definitions
│ ├── factory.py # Factory pattern for analyzer creation
│ ├── analyzers/
│ │ ├── base_analyzer.py # Abstract base class
│ │ ├── malware_analyzer.py # Malware analysis
│ │ └── benignware_analyzer.py # Benignware analysis
│ └── utils/
│ ├── elf_utils.py # ELF binary utilities
│ ├── hash_utils.py # File hashing
│ └── packer_utils.py # Packer detection & AVClass
└── tests/ # Unit tests (coming soon)
Extensibility
Adding a new analyzer type is straightforward:
- Create a new analyzer class in
src/analyzers/inheriting fromBaseAnalyzer - Implement
collect_files()andprocess_single_file()methods - Register it in the factory (
src/factory.py)
Example:
from src.analyzers.base_analyzer import BaseAnalyzer
class CustomAnalyzer(BaseAnalyzer):
def collect_files(self):
# Your implementation
pass
def process_single_file(self, file_path):
# Your implementation
pass
Troubleshooting
Common Issues
-
"AVClass not found"
- Ensure AVClass is installed and in your PATH
- Malware mode requires AVClass for family classification
-
"readelf failed"
- Verify binutils is installed:
which readelf - Some non-ELF files will skip readelf analysis
- Verify binutils is installed:
-
"diec command failed"
- Ensure DiE is properly installed
- Check
diecis accessible:which diec
-
Permission Denied
- Ensure read permissions on input directories
- Ensure write permissions for output CSV location
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
License
This project is open source and available under the MIT License.
Citation
If you use this tool in your research, please cite:
@software{pyelflabeler,
title={PyELFLabeler: A Tool for ELF Binary Dataset Analysis},
author={louiskyee},
year={2024},
url={https://github.com/louiskyee/pyelflabeler}
}
Acknowledgments
- AVClass - Malware family classification
- Detect It Easy - Packer detection
- tqdm - Progress bars
Contact
For questions, issues, or suggestions, please open an issue on GitHub.
Note: This tool is designed for security research and educational purposes. Use responsibly and ethically.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyelflabeler-0.3.1.tar.gz.
File metadata
- Download URL: pyelflabeler-0.3.1.tar.gz
- Upload date:
- Size: 17.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f68cd95b7caeda3d73e9b42b6afabfd469170a04f146049e09bb83355732cf9f
|
|
| MD5 |
65a8160703683eca4e53131fec3874fd
|
|
| BLAKE2b-256 |
d82556ddb77a538814bc811ffc27e8c353daaba2ae72047c2822158cd47e6a80
|
File details
Details for the file pyelflabeler-0.3.1-py3-none-any.whl.
File metadata
- Download URL: pyelflabeler-0.3.1-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e81fa59e201784d4e1aa9700eae25e5c83c96dfcc55aca70573cd5c71239ce1
|
|
| MD5 |
4dc9a8d53267e4e23dc6857fece7d5c8
|
|
| BLAKE2b-256 |
5f0ff198b5a41b6da3153b78a401846fde30af8a4a178ef9c04fc94a6fa57f75
|