Skip to main content

ELF Binary Analysis Tool - Label malware or benignware datasets

Project description

ELF Binary Labeler

中文版本

A powerful Python tool for analyzing and labeling ELF binary datasets, designed for malware and benignware classification. This tool extracts comprehensive metadata from binary files including CPU architecture, endianness, packing information, and malware family classification.

Features

  • Dual Mode Operation

    • Malware Mode: Analyze VirusTotal JSON reports combined with binary files
    • Benignware Mode: Direct binary analysis without JSON reports
  • Comprehensive Binary Analysis

    • ELF header information (CPU, architecture, endianness, file type)
    • Binary metadata (bits, load segments, section headers)
    • File hashing (MD5, SHA256)
    • Packing detection using DiE (Detect It Easy)
    • Malware family classification using AVClass
  • Performance Optimized

    • Multi-process parallel processing
    • Progress tracking with tqdm
    • Efficient single-pass file reading
  • Modern Architecture

    • Modular design with separation of concerns
    • Factory pattern for extensibility
    • Abstract base class for easy extension
    • Managed by modern Python tooling (uv, pyproject.toml)

Prerequisites

Required Tools

  1. Python 3.10+

  2. DiE (Detect It Easy) - for packing detection

  3. AVClass - for malware family classification (malware mode)

    • Automatically installed via Python dependencies
    • Or manually install: pip install avclass-malicialab

Installation

Method 1: Install from PyPI (Recommended)

pip install pyelflabeler

After installation, you can run the tool using the pyelflabeler command:

pyelflabeler --help

Method 2: Install from source with uv

uv is a fast Python package installer and resolver.

  1. Install uv:

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  2. Clone and install:

    git clone https://github.com/louiskyee/pyelflabeler.git
    cd pyelflabeler
    uv sync
    
  3. Run the tool:

    uv run pyelflabeler --help
    # Or use Python module directly
    uv run python -m src.main --help
    

Method 3: Install from source with pip

  1. Clone this repository:

    git clone https://github.com/louiskyee/pyelflabeler.git
    cd pyelflabeler
    
  2. Install in editable mode:

    pip install -e .
    
  3. Verify installation:

    pyelflabeler --help
    diec --version
    

Usage

Malware Mode

Analyze VirusTotal JSON reports combined with binary files:

pyelflabeler --mode malware \
    -i /path/to/json_reports \
    -b /path/to/malware/binaries \
    -o malware_output.csv

Expected Directory Structure:

Both JSON reports and binaries are organized by SHA256 hash prefix:

/path/to/json_reports/
├── 00/
│   ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11.json
│   ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e.json
│   └── ...
├── 01/
│   └── ...
└── ...

/path/to/malware/binaries/
├── 00/
│   ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11
│   ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e
│   └── ...
├── 01/
│   └── ...
└── ...

Files are organized in subdirectories named by the first two characters of their SHA256 hash.

Benignware Mode

Analyze binary files directly without JSON reports:

pyelflabeler --mode benignware \
    -b /path/to/benignware/binaries \
    -o benignware_output.csv

Command Line Options

Option Short Description Required
--mode -m Analysis mode: malware or benignware No (default: malware)
--input_folder -i Folder containing JSON reports Yes (malware mode only)
--binary_folder -b Folder containing binary files Yes (both modes)
--output -o Output CSV file path No (auto-generated)

Output Format

The tool generates a CSV file with the following columns:

Column Description
file_name SHA256 hash of the binary
md5 MD5 hash
label Classification: Malware or Benignware
file_type ELF file type (EXEC, DYN, REL, CORE)
CPU CPU architecture (e.g., x86-64, ARM)
bits Binary bits (32 or 64)
endianness Byte order (little/big endian)
load_segments Number of PT_LOAD segments
is_stripped Whether symbol table is stripped (True/False)
has_section_name Whether section headers exist
family Malware family (malware mode only)
first_seen First seen timestamp (malware mode)
size File size in bytes
diec_is_packed Whether binary is packed (True/False)
diec_packer_info Packer name and version
diec_packing_method Packing method details

Example Output

file_name,md5,label,file_type,CPU,bits,endianness,load_segments,has_section_name,family,first_seen,size,diec_is_packed,diec_packer_info,diec_packing_method
01a2b3c4...,5e6f7g8h...,Malware,EXEC,Advanced Micro Devices X86-64,64,2's complement little endian,2,True,mirai,2024-01-15,45678,True,UPX(3.95),NRV

Error Handling

  • Errors and warnings are logged to {output_filename}_errors.log
  • Failed file analyses continue processing remaining files
  • Detailed debug information available in log files

Performance

  • High-speed parallel processing utilizing all available CPU cores
  • Optimized single-pass file reading for ELF analysis
  • Progress bars for real-time status updates

Project Structure

The project follows modern Python best practices with a modular architecture:

dataset_labeler/
├── main.py                    # CLI entry point
├── pyproject.toml             # Project configuration (uv)
├── requirements.txt           # Legacy pip support
├── src/
│   ├── main.py                # Main CLI logic
│   ├── config.py              # Configuration management
│   ├── constants.py           # CSV field definitions
│   ├── factory.py             # Factory pattern for analyzer creation
│   ├── analyzers/
│   │   ├── base_analyzer.py       # Abstract base class
│   │   ├── malware_analyzer.py    # Malware analysis
│   │   └── benignware_analyzer.py # Benignware analysis
│   └── utils/
│       ├── elf_utils.py       # ELF binary utilities
│       ├── hash_utils.py      # File hashing
│       └── packer_utils.py    # Packer detection & AVClass
└── tests/                     # Unit tests (coming soon)

Extensibility

Adding a new analyzer type is straightforward:

  1. Create a new analyzer class in src/analyzers/ inheriting from BaseAnalyzer
  2. Implement collect_files() and process_single_file() methods
  3. Register it in the factory (src/factory.py)

Example:

from src.analyzers.base_analyzer import BaseAnalyzer

class CustomAnalyzer(BaseAnalyzer):
    def collect_files(self):
        # Your implementation
        pass

    def process_single_file(self, file_path):
        # Your implementation
        pass

Troubleshooting

Common Issues

  1. "AVClass not found"

    • Ensure AVClass is installed and in your PATH
    • Malware mode requires AVClass for family classification
  2. "readelf failed"

    • Verify binutils is installed: which readelf
    • Some non-ELF files will skip readelf analysis
  3. "diec command failed"

    • Ensure DiE is properly installed
    • Check diec is accessible: which diec
  4. Permission Denied

    • Ensure read permissions on input directories
    • Ensure write permissions for output CSV location

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is open source and available under the MIT License.

Citation

If you use this tool in your research, please cite:

@software{pyelflabeler,
  title={PyELFLabeler: A Tool for ELF Binary Dataset Analysis},
  author={louiskyee},
  year={2024},
  url={https://github.com/louiskyee/pyelflabeler}
}

Acknowledgments

Contact

For questions, issues, or suggestions, please open an issue on GitHub.


Note: This tool is designed for security research and educational purposes. Use responsibly and ethically.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyelflabeler-0.3.1.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyelflabeler-0.3.1-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file pyelflabeler-0.3.1.tar.gz.

File metadata

  • Download URL: pyelflabeler-0.3.1.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.5

File hashes

Hashes for pyelflabeler-0.3.1.tar.gz
Algorithm Hash digest
SHA256 f68cd95b7caeda3d73e9b42b6afabfd469170a04f146049e09bb83355732cf9f
MD5 65a8160703683eca4e53131fec3874fd
BLAKE2b-256 d82556ddb77a538814bc811ffc27e8c353daaba2ae72047c2822158cd47e6a80

See more details on using hashes here.

File details

Details for the file pyelflabeler-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pyelflabeler-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8e81fa59e201784d4e1aa9700eae25e5c83c96dfcc55aca70573cd5c71239ce1
MD5 4dc9a8d53267e4e23dc6857fece7d5c8
BLAKE2b-256 5f0ff198b5a41b6da3153b78a401846fde30af8a4a178ef9c04fc94a6fa57f75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page