Skip to main content

Recursive extraction and parsing of firmware and partition images

Project description

REAP

Recursive Extraction And Parsing — a general-purpose CLI tool for identifying and recursively extracting firmware and partition images. Works with raw eMMC/flash dumps, individual partition images, full-disk images (GPT or Rockchip PARM), and forensic disk images from a wide range of embedded Linux and Android devices.

Pure Python. No root, no FUSE, no mounting, no Linux kernel modules. Runs on macOS, Linux, and Windows.

What it does

Point it at a directory of partition .bin files, a single image, or a set of 7z archives and it will:

  1. Identify each image's format via magic bytes (37 format signatures)
  2. Annotate what partition it is (boot, recovery, system, userdata, etc.) by reading ext4 superblock metadata and analyzing ramdisk contents
  3. Extract contents recursively -- e.g. a boot image yields a kernel + ramdisk; the ramdisk decompresses to a cpio archive; the cpio extracts to a filesystem tree
  4. Analyze kernels (version, config, build paths, kallsyms symbol table), bootloaders (U-Boot environment, embedded DTBs), and unknown partitions (forensic hex dump, strings, SHA256)
  5. Report everything found in human-readable text and machine-readable JSON, including FBE encryption detection

Supported formats

Partition tables and disk layouts

Format Detection Extraction
GPT partition table EFI PART at offset 0x200 or 0x1000 (UFS 4K sectors) Individual partition images
Rockchip PARM partition table PARM at offset 0 Individual partition images (RK29xx/RK3xxx flash dumps)
Android super.img (LP metadata) 0x67446C70 at offset 0x1000 Logical partition images (system, vendor, product, etc.)

Android boot and kernel

Format Detection Extraction
Android Boot Image (v0--v4) ANDROID! magic Kernel, ramdisk, second-stage, recovery DTBO, DTB
ARM zImage 0x016F2818 at offset 0x24 Decompressed vmlinux, kernel config, version string, source paths, kallsyms, all strings
ARM64 Image ARM\x64 at offset 0x38 Kernel config, version string, source paths, kallsyms, all strings
Raw ARM kernel binary MSR CPSR instruction + Linux version string Kernel config, version string, source paths, kallsyms, all strings
Device Tree Blob (DTB) 0xD00DFEED Extracted DTB, optional dtc decompile to DTS
DTBO container 0xD7B7AB1E Individual DT overlay entries

Bootloaders and firmware

Format Detection Extraction
U-Boot uImage 0x27051956 Unwrapped payload (kernel, ramdisk, firmware, device tree, etc.)
U-Boot binary U-Boot <version> string, 64 KB--4 MB Default environment, embedded DTBs, strings
U-Boot environment CRC32 + key=value pairs, power-of-2 size Parsed environment variables
Samsung Exynos boot partition BL1 header pointer + Exynos BL label bl1.bin, u-boot.bin, tzsw.bin
Rockchip KRNL wrapper KRNL at offset 0 Unwrapped payload (re-identified as gzip, zImage, etc.)
ELF binary \x7fELF magic Metadata dump (class, machine, entry point), strings
AVB vbmeta AVB0 / AVBf Metadata dump (version, algorithm, rollback index, flags)

Encrypted firmware containers

Format Detection Extraction
IM*H firmware container IM*H at offset 0 or 0x400 Header parse (version, module name/type, chunk table, key family). Encrypted chunks (RTOS, kernel, TZOS, DTB, etc.) extracted as raw .bin files. Decryption is out of scope for this tool.
Ambarella environment (UNR0) UNR0 + 0x5AA5 flags Boot config, A/B slot status, firmware versions, bootloader logs

Filesystems

Format Detection Extraction
ext4 0xEF53 at offset 0x438 Full filesystem tree with FBE encryption detection
FAT12/16/32 0xEB/0xE9 + 0x55AA at 510 Full filesystem tree (LFN support)
exFAT EXFAT OEM ID at offset 3 Identified only (no extraction yet)
EROFS 0xE0F5E1E2 at offset 0x400 Identified only (no extraction yet)
F2FS 0xF2F52010 at offset 0x400 Identified only (no extraction yet)

Compression and archives

Format Detection Extraction
gzip 1F 8B Decompressed content
LZ4 frame 04 22 4D 18 Decompressed content
LZ4 legacy 02 21 4C 18 Decompressed content (Android ramdisk format)
LZMA 5D 00 00 Decompressed content
bzip2 BZh Decompressed content
XZ FD 37 7A 58 5A 00 Decompressed content
cpio newc 070701 / 070702 Files, directories, symlinks (as text files with -> target)
7z archive 37 7A BC AF 27 1C Full decompression (supports split .7z.001 parts)
Android sparse image 0xED26FF3A Converted to raw image, then re-identified and extracted

Device-specific partitions

Format Detection Extraction
Android devinfo ANDROID-BOOT! magic Lock status, tamper flags
Qualcomm modemst (EFS) IMGEFS marker in first 64 bytes Forensic scan (SHA256, strings, hex dump)
BMP image BM + valid DIB header Trimmed BMP (strips partition padding)
Boot logo container ASCII count/sizes header + BMP at 0x200 Individual BMP images
Empty / zeroed All-zero content Verified-empty marker with likely purpose annotation

Installation

Requires Python 3.10+ (tested with 3.11).

From PyPI:

pip install reap-cli

The PyPI distribution is reap-cli because the bare reap name on PyPI is held by an unrelated, long-abandoned 2012 package. We are pursuing a PEP 541 transfer. The installed CLI command is reap regardless.

From source (for development):

git clone https://gitlab.com/blackbox-research/reap
cd reap
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Dependencies (installed automatically):

  • ext4 -- pure-Python ext4 filesystem reader (no FUSE/mounting)
  • lz4 -- LZ4 decompression for Android ramdisks

Third-party plugins that add format handlers or detectors are discovered automatically via the reap.plugins entry point group. See the architecture section below.

Usage

reap <input_path> [options]

input_path can be a single image file, a directory containing partition images, or a set of 7z archives.

Options

Flag Description
-o DIR Output directory (default: <input>_unpacked/)
--identify-only Print format identification only, no extraction
--skip-ext4 Skip ext4 filesystem extraction (useful for huge partitions)
--skip-archives Skip 7z archive extraction
--force-archives Force archive extraction even when physicalImage/ already exists
--no-recursive Don't recurse into extracted children
--max-depth N Maximum recursion depth (default: 10)
-j, --jobs N Parallel extraction workers (0=auto, 1=sequential; default: auto)
-v Verbose output (INFO level)
-vv Debug output
--report text|json|both Report format (default: both)

Examples

Identify all partitions in a dump:

reap ./physicalImage --identify-only

Full extraction (skip large ext4 partitions):

reap ./physicalImage --skip-ext4 -v

Extract a single boot image:

reap boot.img -o ./boot_extracted -v

Extract a directory of 7z archives (split parts supported):

reap ./archives/ -v

Parallel extraction with 4 workers:

reap ./physicalImage -j 4 -v

Output structure

For a boot image, the recursive extraction produces:

boot_unpacked/
    kernel_info.txt          # Kernel analysis summary
    kernel_config.txt        # Build-time .config (if IKCONFIG enabled)
    kernel_source_paths.txt  # Build-time source paths
    kernel_strings.txt       # All embedded ASCII strings
    kallsyms.txt             # Kernel symbol table (if present)
    vmlinux                  # Decompressed kernel binary
    ramdisk_unpacked/
        ramdisk_unpacked/    # cpio filesystem tree
            init
            init.rc
            fstab.*
            sbin/
            ...

For a directory of partitions, you get a subdirectory per partition plus reports:

physicalImage_unpacked/
    report.txt               # Human-readable report
    report.json              # Machine-readable report
    mmcblk0p1/               # boot image contents
    mmcblk0p2/               # DTB contents
    mmcblk0p3/               # recovery image contents
    mmcblk0p4/               # system filesystem tree
    ...

Partition annotation

The tool automatically identifies partition roles by:

  • Reading the ext4 superblock s_last_mounted field (e.g. /system, /data, /cache)
  • Analyzing boot image ramdisks for /sbin/recovery to distinguish boot vs recovery
  • Parsing U-Boot uImage type fields (kernel, ramdisk, firmware, device tree)
  • Parsing IM*H firmware module names and types (bootloader, kernel, RTOS)
  • Recognizing format-specific roles (DTB, vbmeta, DTBO, sparse, super, modemst)
  • Inferring empty partition purpose from size (<=4 MB zeroed = likely misc or metadata)

Annotations appear in reports and verbose output as labels like (recovery), (system), (userdata), etc.

FBE encryption detection

When extracting ext4 filesystems with File-Based Encryption (FBE), the tool:

  • Detects the encryption superblock flag and per-inode encryption flags
  • Hex-encodes encrypted filenames for safe extraction
  • Writes encrypted_paths.txt listing all encrypted files and directories
  • Reports encryption algorithms (AES-256-XTS, AES-256-GCM, etc.) in JSON output

Architecture

reap/
    cli.py              # Argument parsing, entry point
    identify.py         # Magic-byte format detection (37 signatures)
    annotate.py         # Partition role inference
    pipeline.py         # Recursive extraction orchestrator (parallel workers)
    report.py           # Text + JSON report generation (FBE-aware)
    handlers/
        __init__.py     # BaseHandler ABC, handler registry
        ambarella_env.py # Ambarella UNR0 boot environment
        avb.py          # AVB vbmeta metadata
        bmp.py          # BMP image (partition padding trim)
        boot_img.py     # Android boot image (v0--v4)
        bootlogo.py     # Boot logo container (multiple BMPs)
        compression.py  # gzip, LZ4, LZMA, bzip2, XZ
        cpio_handler.py # cpio newc archives
        devinfo.py      # Android devinfo (lock status)
        dji_imah.py     # IM*H encrypted firmware container (header parse, encrypted chunks)
        dtb.py          # Device Tree Blob
        dtbo.py         # DTBO container
        elf.py          # ELF binary metadata + strings
        ext4_handler.py # ext4 filesystem (FBE detection, dir_index fallback)
        exynos_boot.py  # Samsung Exynos eMMC boot partition
        fat.py          # FAT12/16/32 filesystem
        gpt.py          # GPT partition table (512-byte + 4K UFS sectors)
        modemst.py      # Qualcomm modem EFS partition
        raw.py          # Empty + unknown fallback (forensic scan)
        raw_kernel.py   # Raw ARM kernel binary
        rk_krnl.py      # Rockchip KRNL wrapper
        rkparm.py       # Rockchip PARM partition table
        seven_zip.py    # 7z archive (split-part support)
        sparse_img.py   # Android sparse -> raw conversion
        super_img.py    # super.img LP metadata
        uboot_bin.py    # U-Boot binary (environment, embedded DTBs)
        uboot_env.py    # U-Boot environment block
        uimage.py       # U-Boot uImage wrapper
        zimage.py       # ARM zImage / ARM64 Image kernel extraction
        _kernel_utils.py # Shared kernel analysis (version, config, kallsyms)

Each handler implements BaseHandler.extract() and returns an ExtractionResult with optional children for recursive processing. Handlers register themselves at import time via register_handler().

The pipeline orchestrator (pipeline.py) drives the flow: identify -> annotate -> dispatch to handler -> recurse into children. Children can be processed in parallel via ThreadPoolExecutor. Adding new formats is straightforward -- write a handler, register it for a Format enum value, and the pipeline picks it up automatically.

Symlink handling

Symlinks found inside ext4 filesystems and cpio archives are not created as OS symlinks (which can cause issues on some platforms and create security risks with path traversal). Instead, they're written as small text files containing -> target and recorded in the extraction metadata / JSON report.

Running tests

pip install -e ".[dev]"
pytest tests/ -v

110 tests covering format detection, handler extraction, kernel analysis (kallsyms), pipeline orchestration, and forensic scanning. All tests use synthetic data -- no real image files needed.

Known limitations

  • EROFS, F2FS, and exFAT: Identified but not yet extracted (no pure-Python reader available).
  • Encrypted partitions: FBE-encrypted ext4 partitions are detected and documented, but file contents remain encrypted. The tool does not perform Android FDE/FBE decryption.
  • IM*H decryption: Out of scope. The core tool parses IM*H headers and extracts encrypted chunks as raw .bin files. Producing plaintext requires AES keys that are not distributed with this tool.
  • Large partitions: Extracting a 54 GB ext4 partition takes time and disk space. Use --skip-ext4 to skip these, or extract individual partitions as needed.
  • 7z extraction: Requires system 7z binary for split archives; falls back to py7zr for single files.
  • Symlinks: Recorded as text files, not created as actual OS symlinks.
  • Text files: Plain-text metadata files (.txt, .sha256, .xml, README) in the input directory are detected and skipped rather than subjected to forensic extraction.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reap_cli-0.1.1.tar.gz (101.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reap_cli-0.1.1-py3-none-any.whl (106.5 kB view details)

Uploaded Python 3

File details

Details for the file reap_cli-0.1.1.tar.gz.

File metadata

  • Download URL: reap_cli-0.1.1.tar.gz
  • Upload date:
  • Size: 101.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for reap_cli-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e5fbe7de201ae6500f775b729758759ee9f4c122a2399a643f5a54ba8a6ba4ce
MD5 086a2932518615e9f7070005dacef968
BLAKE2b-256 2206dc7df2d89053bcfbc8f3d6494e8bccc61228377000485c65708234433559

See more details on using hashes here.

File details

Details for the file reap_cli-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: reap_cli-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 106.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for reap_cli-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 424eaaa638e6d312007cfb8233e680dadb030314eb0ddc8d26f4f1ec05fb374b
MD5 512c869afee0e4cff73e9f51b80e5aaf
BLAKE2b-256 3e721f32a51f024372654449dc145c9b2e0819cbc3079571c086dc59cbcda241

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page