Skip to main content

ELF Binary Analysis Tool - Label malware or benignware datasets

Project description

ELF Binary Labeler

中文版本

A powerful Python tool for analyzing and labeling ELF binary datasets, designed for malware and benignware classification. This tool extracts comprehensive metadata from binary files including CPU architecture, endianness, packing information, and malware family classification.

Features

  • Dual Mode Operation

    • Malware Mode: Analyze VirusTotal JSON reports combined with binary files
    • Benignware Mode: Direct binary analysis without JSON reports
  • Comprehensive Binary Analysis

    • ELF header information (CPU, architecture, endianness, file type)
    • Binary metadata (bits, load segments, section headers)
    • File hashing (MD5, SHA256)
    • Packing detection using DiE (Detect It Easy)
    • Malware family classification using AVClass
  • Performance Optimized

    • Multi-process parallel processing
    • Progress tracking with tqdm
    • Efficient single-pass file reading
  • Modern Architecture

    • Modular design with separation of concerns
    • Factory pattern for extensibility
    • Abstract base class for easy extension
    • Managed by modern Python tooling (uv, pyproject.toml)

Prerequisites

Required Tools

  1. Python 3.10+

  2. DiE (Detect It Easy) - for packing detection

  3. AVClass - for malware family classification (malware mode)

    • Automatically installed via Python dependencies
    • Or manually install: pip install avclass-malicialab

Installation

Method 1: Install from PyPI (Recommended)

pip install pyelflabeler

After installation, you can run the tool using the pyelflabeler command:

pyelflabeler --help

Method 2: Install from source with uv

uv is a fast Python package installer and resolver.

  1. Install uv:

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  2. Clone and install:

    git clone https://github.com/bolin8017/pyelflabeler.git
    cd pyelflabeler
    uv sync
    
  3. Run the tool:

    uv run pyelflabeler --help
    # Or use Python module directly
    uv run python -m src.main --help
    

Method 3: Install from source with pip

  1. Clone this repository:

    git clone https://github.com/bolin8017/pyelflabeler.git
    cd pyelflabeler
    
  2. Install in editable mode:

    pip install -e .
    
  3. Verify installation:

    pyelflabeler --help
    diec --version
    

Usage

Malware Mode

Analyze VirusTotal JSON reports combined with binary files:

pyelflabeler --mode malware \
    -i /path/to/json_reports \
    -b /path/to/malware/binaries \
    -o malware_output.csv

Expected Directory Structure:

Both JSON reports and binaries are organized by SHA256 hash prefix:

/path/to/json_reports/
├── 00/
│   ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11.json
│   ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e.json
│   └── ...
├── 01/
│   └── ...
└── ...

/path/to/malware/binaries/
├── 00/
│   ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11
│   ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e
│   └── ...
├── 01/
│   └── ...
└── ...

Files are organized in subdirectories named by the first two characters of their SHA256 hash.

Benignware Mode

Analyze binary files directly without JSON reports:

pyelflabeler --mode benignware \
    -b /path/to/benignware/binaries \
    -o benignware_output.csv

Command Line Options

Option Short Description Required
--mode -m Analysis mode: malware or benignware No (default: malware)
--input_folder -i Folder containing JSON reports Yes (malware mode only)
--binary_folder -b Folder containing binary files Yes (both modes)
--output -o Output CSV file path No (auto-generated)

Output Format

The tool generates a CSV file with the following columns:

Column Description
file_name SHA256 hash of the binary
md5 MD5 hash
label Classification: Malware or Benignware
file_type ELF file type (EXEC, DYN, REL, CORE)
CPU CPU architecture (e.g., x86-64, ARM)
bits Binary bits (32 or 64)
endianness Byte order (little/big endian)
load_segments Number of PT_LOAD segments
is_stripped Whether symbol table is stripped (True/False)
has_section_name Whether section headers exist
family Malware family (malware mode only)
first_seen First seen timestamp (malware mode)
size File size in bytes
diec_is_packed Whether binary is packed (True/False)
diec_packer_info Packer name and version
diec_packing_method Packing method details

Example Output

file_name,md5,label,file_type,CPU,bits,endianness,load_segments,has_section_name,family,first_seen,size,diec_is_packed,diec_packer_info,diec_packing_method
01a2b3c4...,5e6f7g8h...,Malware,EXEC,Advanced Micro Devices X86-64,64,2's complement little endian,2,True,mirai,2024-01-15,45678,True,UPX(3.95),NRV

Error Handling

  • Errors and warnings are logged to {output_filename}_errors.log
  • Failed file analyses continue processing remaining files
  • Detailed debug information available in log files

Performance

  • High-speed parallel processing utilizing all available CPU cores
  • Optimized single-pass file reading for ELF analysis
  • Progress bars for real-time status updates

Project Structure

The project follows modern Python best practices with a modular architecture:

dataset_labeler/
├── main.py                    # CLI entry point
├── pyproject.toml             # Project configuration (uv)
├── requirements.txt           # Legacy pip support
├── src/
│   ├── main.py                # Main CLI logic
│   ├── config.py              # Configuration management
│   ├── constants.py           # CSV field definitions
│   ├── factory.py             # Factory pattern for analyzer creation
│   ├── analyzers/
│   │   ├── base_analyzer.py       # Abstract base class
│   │   ├── malware_analyzer.py    # Malware analysis
│   │   └── benignware_analyzer.py # Benignware analysis
│   └── utils/
│       ├── elf_utils.py       # ELF binary utilities
│       ├── hash_utils.py      # File hashing
│       └── packer_utils.py    # Packer detection & AVClass
└── tests/                     # Unit tests (coming soon)

Extensibility

Adding a new analyzer type is straightforward:

  1. Create a new analyzer class in src/analyzers/ inheriting from BaseAnalyzer
  2. Implement collect_files() and process_single_file() methods
  3. Register it in the factory (src/factory.py)

Example:

from src.analyzers.base_analyzer import BaseAnalyzer

class CustomAnalyzer(BaseAnalyzer):
    def collect_files(self):
        # Your implementation
        pass

    def process_single_file(self, file_path):
        # Your implementation
        pass

Troubleshooting

Common Issues

  1. "AVClass not found"

    • Ensure AVClass is installed and in your PATH
    • Malware mode requires AVClass for family classification
  2. "readelf failed"

    • Verify binutils is installed: which readelf
    • Some non-ELF files will skip readelf analysis
  3. "diec command failed"

    • Ensure DiE is properly installed
    • Check diec is accessible: which diec
  4. Permission Denied

    • Ensure read permissions on input directories
    • Ensure write permissions for output CSV location

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is open source and available under the MIT License.

Citation

If you use this tool in your research, please cite:

@software{pyelflabeler,
  title={PyELFLabeler: A Tool for ELF Binary Dataset Analysis},
  author={bolin8017},
  year={2025},
  url={https://github.com/bolin8017/pyelflabeler}
}

Acknowledgments

Contact

For questions, issues, or suggestions, please open an issue on GitHub.


Note: This tool is designed for security research and educational purposes. Use responsibly and ethically.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyelflabeler-0.3.2.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyelflabeler-0.3.2-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file pyelflabeler-0.3.2.tar.gz.

File metadata

  • Download URL: pyelflabeler-0.3.2.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.5

File hashes

Hashes for pyelflabeler-0.3.2.tar.gz
Algorithm Hash digest
SHA256 1f0cbec52b3982f435af5a20cf36f260986ab5d8227978a9d8cf9cea1d507472
MD5 f5737c674277baa61226cb1045eeaa00
BLAKE2b-256 ebf67ad09e5bcc8c5052426bc309bdbd9f644b1733a048a931bbb3d7bf0f4483

See more details on using hashes here.

File details

Details for the file pyelflabeler-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pyelflabeler-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9bab5acca70958a294172d3bf626de23a96a2316ccfaa654c3a30556cc6c9e84
MD5 fd6d2046900a16f12a030bb7056bd370
BLAKE2b-256 23ecbf0435db07ca6192525fc035d7163fa0e74fb5ef5cfc403dc3b0c40b04ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page