Skip to main content

Professional file recovery and carving tool for disk images and block devices

Project description

RecoverX

Professional file recovery and carving tool for disk images and block devices.

Python 3.10+ pytest 954 passing Coverage 85% Code style: black CI passing MIT License Status: Stable


RecoverX extracts deleted or lost files from raw disk images (.img, .dd, .raw) and block devices using signature-based file carving. Its modular architecture makes adding new file formats trivial — implement a single method and register a signature.


Features

  • JPEG carving — extracts JPEG images via SOI (FFD8FF) / EOI (FFD9) marker detection with configurable lookback window
  • Raw image scanning — read-only sector-level and offset-level access to disk images and physical block devices
  • Disk detection — enumerate connected disks, partitions, and block devices with size, type, and mount point information
  • Read-only architecture — every disk operation is strictly read-only; no writes to the source image
  • Modular carving engineBaseCarver ABC + FileSignature dataclass; add PNG/PDF/ZIP by creating one file
  • Rich CLI — coloured output, live progress bars, formatted tables via rich
  • Dual logging — console (INFO+) + structured file logs (DEBUG+)
  • Extensible — drop-in carvers, centralised signature registry, recovery manager with auto-naming
  • PNG carving — extracts PNG images via \x89PNG header / IEND footer signature matching
  • GIF carving — supports both GIF87a and GIF89a formats
  • BMP carving — uses file-size-from-header for accurate extraction
  • PDF carving — extracts PDFs via %PDF / %%EOF markers
  • SHA-256 forensic hashing — per-file SHA-256 hash displayed in CLI output; deduplication support
  • Hash database — persistent SHA-256 hash storage across runs for dedup and statistics
  • Chunked streaming scanner — memory-efficient, configurable chunk/overlap sizes (default 4 MB)
  • Memory-mapped scanner — zero-copy reads with automatic fallback to streaming
  • Multithreaded scanner — parallel region-based scanning with --threads CLI flag
  • Scan benchmarking — elapsed time, MB/s, CPU%, RAM, files/min; exports to JSON
  • Professional Progress Engine — real-time progress tracking with scanned/total, throughput MB/s, ETA, active threads, findings by type; thread-safe counters with Rich live display
  • Quick Scan Mode (--quick) — prioritise MFT region, boot sector, and tail regions for faster results on large images
  • Scan Limits (--max-size, --max-time) — limit scan duration or byte count with graceful stop and partial results
  • Graceful Interruptions — CTRL+C handling preserves recovered files and prints partial summary
  • Live Findings Preview (--live-findings) — real-time file discoveries during scan
  • Smart Type Filtering (--type) — activate only selected carvers (jpg,png,pdf) for targeted recovery
  • JSON forensic reports — structured output usable in forensic pipelines (--report report.json)
  • Filesystem detection — automatic identification of FAT12/16/32, exFAT, NTFS, ext2/3/4
  • Direct disk accessrecoverx devices lists connected disks; recoverx scan /dev/sdX reads raw devices (read-only)
  • FAT32 filesystem analysis — boot sector parsing, directory traversal (SFN + LFN), cluster chain reading
  • FAT32 deleted file recovery — scan for 0xE5-marked entries, reconstruct cluster chains, recover with SHA-256
  • FAT32 CLIrecoverx fat32 info, list, deleted, recover with --json output
  • NTFS filesystem analysis — boot sector parser, MFT record walker, attribute system (STANDARD_INFORMATION, FILE_NAME, DATA), resident data extraction
  • NTFS deleted entry detection — scan MFT for FILE records with IN_USE=0 flag
  • NTFS non-resident DATA recovery — runlist execution engine with VCN→LCN translation, fragmented file reconstruction, sparse file support
  • NTFS runlist validation — overlap detection, OOB protection, circular run detection, data integrity checks
  • NTFS recovery CLIrecoverx ntfs recover with --deleted-only, --non-resident-only, --verify-hashes, --json, threaded support
  • NTFS analyse CLIrecoverx ntfs analyse --record N for detailed runlist analysis with validation issues
  • NTFS CLIrecoverx ntfs info, mft, deleted, resident with --json output
  • NTFS USN journal parser — parse $UsnJrnl records (V2/V3) with reason flag detection, rename pairing, timeline integration
  • NTFS $LogFile parser — restart page parsing, log record extraction, operation type detection
  • Forensic timeline engine — event sorting, deduplication, filtering, JSON/CSV/text export
  • Forensic event abstraction — unified ForensicEvent model with EventType, EventSource, Confidence scoring
  • Forensic correlation engine — MFT↔USN matching, rename chain reconstruction, file history tracking
  • Forensic indexing engine — SQLite persistence with schema management, WAL mode, transaction batching, LRU cache
  • Forensic query engine — simple forensic query language with AST parser and SQL translation
  • Investigation case management — create cases, bookmarks, saved queries, artifact tagging, notes
  • Artifact abstraction layerArtifact, FileArtifact, TimelineArtifact, DeletedArtifact, HashArtifact
  • Forensic reporting — CSV, JSON, Markdown export, investigation summary reports
  • Advanced correlation — delete/recreate detection, timestamp anomaly, orphan reconstruction
  • Correlation Engine V2 — advanced multi-source correlation with graph-based relationship modeling, rename chains, anomaly detection, heuristic analysis, confidence scoring
  • Event Graph EngineCorrelationGraph with nodes/edges, BFS traversal, path finding, anomaly clustering, evidence chain tracing
  • Distributed Indexing FoundationCoordinator, Worker, TaskQueue, Scheduler, priority-based task scheduling, retry logic, heartbeat protocol
  • Remote Acquisition FoundationAcquisitionSession, AcquisitionTarget, ImageStream, TransportInterface, read-only guarantees, chunked data transfer
  • Plugin SDKPlugin base class, PluginRegistry, PluginLoader, typed interfaces (FilesystemParserPlugin, AnalyzerPlugin, ReportExporterPlugin, etc.), lifecycle management
  • Analyzer FrameworkBaseAnalyzer ABC, specialized analyzers: MassDeleteAnalyzer, SuspiciousRenameAnalyzer, TimestampAnomalyAnalyzer, DuplicateActivityAnalyzer, OrphanArtifactAnalyzer
  • Forensic Findings EngineFindingsEngine with Finding dataclass, severity scoring, evidence chains, category classification, confidence filtering
  • Query Optimization LayerQueryPlanner with filter pushdown, index scan planning, cost estimation; QueryCache with TTL-based expiry, LRU eviction; MetricsCollector for query performance tracking
  • Forensic Export SystemForensicBundle with manifest, integrity hash; SQLitePackage with structured event/finding/artifact tables
  • Performance & ScalabilityStreamingIndexer (bounded batches), IncrementalIndexer (resumable), ParallelAnalyzer (thread pool), MemoryPressureGuard (allocation tracking)
  • Forensic CLIrecoverx forensic timeline, search, query, export, summary, index, findings, graph
  • Case CLIrecoverx case create, open, list, close, delete
  • Plugin CLIrecoverx plugins list
  • Fuzz testing — 42 fuzz tests protecting binary parsers, query engine, distributed system, plugin loader, and query optimizer against corruption and malicious input
  • Recovery validation — precision, recovery rate, metadata integrity, and hash consistency measurements
  • CI/CD automation — GitHub Actions with matrix testing (3.10/3.11/3.12), linting, type checking, security scanning
  • Static analysismypy type checking + bandit security scanning
  • Performance profilingProfiler context manager with CPU, RAM, throughput metrics, JSON export
  • Testing suite — 954 pytest tests across all core modules

Installation

# Clone the repository
git clone https://github.com/recoverx/recoverx.git
cd recoverx

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install the package
pip install -e .

# (Optional) Development dependencies for linting and testing
pip install -e ".[dev]"

Usage

# Show connected disks and partitions
recoverx info

# Scan a disk image for recoverable files
recoverx scan sample.img

# Show help
recoverx --help

Commands

Command Description
recoverx info List connected disks, partitions, block devices
recoverx scan Scan image/device and carve recoverable files

Example output

RecoverX — Scanning sample.img
  Size:    10.0 MB
  Sectors: 20,480

Reading image...
Reading... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.5/10.5 MB 0:00:00

Carving files...
  [+] JPEG found at offset 204,800
      SHA256: a1b2c3d4e5f6...
      Saved: recovered/jpeg_001.jpg
  [+] PNG found at offset 1,048,576
      SHA256: f6e5d4c3b2a1...
      Saved: recovered/png_001.png

                   Recovered Files
┏━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ # ┃ File         ┃               Offset ┃     Size ┃ SHA256                      ┃
┡━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1 │ jpeg_001.jpg │    0x32000 (204,800) │ 1014.0 B │ a1b2c3d4e5f6...            │
│ 2 │ png_001.png  │ 0x100000 (1,048,576) │  2.5 KB  │ f6e5d4c3b2a1...            │
└───┴──────────────┴──────────────────────┴──────────┴────────────────────────────┘

Scan complete: 2 file(s) recovered in 0.32s (32.8 MB/s)

Development

source .venv/bin/activate
pip install -e ".[dev]"

Testing

pytest -v

Linting and formatting

black src/ tests/
isort src/ tests/
flake8 src/ tests/

Generate test image

python tests/create_sample.py
recoverx scan sample.img

Architecture

recoverx/
├── src/
│   └── recoverx/
│       ├── __init__.py           # Package root
│       ├── cli/
│       │   ├── main.py           # Typer app, command registration
│       │   └── commands/
│       │       ├── info.py       # recoverx info — disk detection
│       │       ├── scan.py       # recoverx scan — carving pipeline
│       │       ├── forensic.py   # recoverx forensic — timeline, findings, graph
│       │       ├── plugins.py    # recoverx plugins — list plugins
│       │       ├── cases.py      # recoverx case — create, open, list, close
│       │       ├── sources.py    # Shared MFT/USN collection helpers
│       │       └── ntfs.py       # recoverx ntfs — USN, LogFile, recovery
│       └── core/
│           ├── disk/
│           │   └── detector.py   # psutil + /sys/block enumeration
│           ├── carving/
│           │   ├── base.py       # BaseCarver ABC + CarvedFile / FileSignature
│           │   ├── jpg.py        # JPEG carver (FFD8FF / FFD9)
│           │   ├── png.py        # PNG carver (\x89PNG / IEND)
│           │   ├── gif.py        # GIF carver (GIF87a / GIF89a)
│           │   ├── bmp.py        # BMP carver (BM + header size)
│           │   ├── pdf.py        # PDF carver (%PDF / %%EOF)
│           │   ├── streaming.py  # Chunked streaming scanner with overlap
│           │   └── signatures.py # Centralised signature registry
│           ├── scanner/
│           │   ├── mmap_scanner.py     # Memory-mapped scanner (zero-copy)
│           │   └── threaded_scanner.py # Parallel region-based scanner
│           ├── recovery/
│           │   └── manager.py    # Auto-named output, counter per extension
│           ├── reporting/
│           │   └── json_report.py # JSON forensic report generator
│           ├── benchmark/
│           │   ├── advanced_benchmark.py # CPU/RAM/throughput metrics
│           │   └── profiler.py           # Context manager profiler + decorator
│           ├── forensics/       # Forensic analysis framework
│           │   ├── models.py    # ForensicEvent, EventType, Confidence
│           │   ├── events.py    # Event factory functions
│           │   ├── timeline.py  # Timeline builder, sort, filter, export
│           │   ├── artifacts.py # Rename/deletion chains, activity summaries
│           │   ├── correlation.py # MFT↔USN matching, cross-source correlation
│           │   └── reporting/   # CSV/JSON/Markdown export, summaries
│           ├── artifacts/       # Artifact abstraction layer
│           │   └── models.py    # Artifact, FileArtifact, DeletedArtifact, etc.
│           ├── indexing/        # Forensic indexing engine
│           │   ├── engine.py    # IndexEngine orchestrator
│           │   ├── storage.py   # SQLite storage backend (WAL, search)
│           │   ├── schema.py    # Schema management, migrations, integrity
│           │   ├── cache.py     # Bounded LRU cache with hit tracking
│           │   ├── transactions.py # Bulk insert batching
│           │   └── models.py    # IndexConfig, IndexStats dataclasses
│           ├── query/           # Forensic query engine
│           │   ├── ast.py       # Query AST nodes
│           │   ├── parser.py    # Query tokenizer and parser
│           │   ├── operators.py # Operator enum (==, !=, >, <, ~, etc.)
│           │   ├── filters.py   # AST-to-SQL filter builder
│           │   └── engine.py    # Query execution engine
│           ├── cases/           # Investigation workflows
│           │   ├── models.py    # CaseMetadata, SavedQuery, Bookmark, TaggedArtifact
│           │   └── cases.py     # CaseManager, Case (CRUD, bookmarks, tags, notes)
│           ├── correlation/     # Advanced correlation engine v2
│           │   ├── engine.py    # CorrelationEngineV2 orchestrator
│           │   ├── chains.py    # RenameChain, DeleteRecreateChain, ChainBuilder
│           │   ├── anomalies.py # AnomalyDetector, timestamp/rapid/interleaved
│           │   ├── heuristics.py# HeuristicEngine, MassDeleteRule, SuspiciousRenameRule
│           │   ├── scoring.py   # CorrelationScorer, CorrelationScore
│           │   └── graph.py     # CorrelationGraph, nodes, edges, traversal
│           ├── distributed/     # Distributed indexing foundation
│           │   ├── coordinator.py # Coordinator, worker management
│           │   ├── worker.py    # Worker task execution
│           │   ├── models.py    # Task, TaskState, ChunkedTask, CompositeTask
│           │   ├── queue.py     # Priority TaskQueue with heap
│           │   ├── scheduler.py # Scheduler with concurrent execution
│           │   └── protocol.py  # TaskMessage, ResultMessage, HeartbeatMessage
│           ├── acquisition/     # Remote acquisition foundation
│           │   ├── sessions.py  # AcquisitionSession lifecycle
│           │   ├── targets.py   # AcquisitionTarget, TargetMetadata
│           │   ├── streams.py   # ImageStream chunked reading
│           │   └── transport.py # TransportInterface, LocalTransport
│           ├── analyzers/       # Specialized forensic analyzers
│           │   ├── base.py      # BaseAnalyzer, AnalysisResult, FindingSeverity
│           │   ├── mass_delete.py
│           │   ├── suspicious_rename.py
│           │   ├── timestamp_anomaly.py
│           │   ├── duplicate_activity.py
│           │   └── orphan_artifact.py
│           ├── findings/        # Forensic findings engine
│           │   ├── engine.py    # FindingsEngine, Finding, FindingCategory
│           │   └── evidence.py  # EvidenceChain, EvidenceLink
│           ├── optimizer/       # Query optimization layer
│           │   ├── planner.py   # QueryPlanner, filter pushdown, cost estimation
│           │   ├── cache.py     # QueryCache with TTL, LRU eviction
│           │   └── metrics.py   # MetricsCollector, QueryMetrics
│           ├── performance/     # Performance & scalability
│           │   ├── streaming.py # StreamingIndexer bounded batches
│           │   ├── incremental.py # IncrementalIndexer resumable
│           │   ├── parallel.py  # ParallelAnalyzer thread pool
│           │   └── memory.py    # MemoryPressureGuard allocation tracking
│           ├── export/          # Forensic export system
│           │   ├── bundle.py    # ForensicBundle with manifest
│           │   └── package.py   # SQLitePackage structured export
│           ├── filesystems/
│           │   ├── __init__.py   # Filesystem registry (future plugin loading)
│           │   ├── detector.py   # FAT/NTFS/ext4/exFAT detection
│           │   ├── fat32/        # FAT32 analysis and recovery
│           │   │   ├── boot_sector.py
│           │   │   ├── fat_table.py
│           │   │   ├── directory.py
│           │   │   └── recovery.py
│           │   └── ntfs/         # NTFS analysis and recovery
│           │       ├── boot_sector.py
│           │       ├── mft.py
│           │       ├── attributes.py
│           │       ├── recovery.py
│           │       ├── structures.py
│           │       ├── constants.py
│           │       ├── runlists/  # Runlist execution engine
│           │       │   ├── mapping.py
│           │       │   ├── executor.py
│           │       │   ├── sparse.py
│           │       │   └── validation.py
│           │       ├── usn/       # USN Journal parser
│           │       │   ├── parser.py
│           │       │   ├── records.py
│           │       │   ├── reasons.py
│           │       │   ├── mapping.py
│           │       │   └── structures.py
│           │       └── logfile/   # $LogFile parser
│           │           ├── parser.py
│           │           ├── records.py
│           │           ├── restart_area.py
│           │           └── structures.py
│           └── utils/
│               ├── raw_reader.py # Read-only binary reader (offset/sector)
│               ├── logger.py     # Rich console + file dual logging
│               ├── hashing.py        # SHA-256 hashing, HashManager
│               ├── hash_database.py  # Persistent hash storage / dedup
│               ├── benchmark.py      # ScanBenchmark (elapsed, MB/s)
│               └── file_utils.py     # format_size helper
│           └── plugins/         # Plugin SDK
│               ├── __init__.py  # Plugin, PluginType, PluginRegistry exports
│               ├── base.py      # Plugin, PluginType, PluginCapabilities
│               ├── interfaces.py # Typed plugin interfaces (FilesystemParserPlugin, etc.)
│               ├── registry.py  # PluginRegistry with type-based queries
│               ├── loader.py    # PluginLoader (module/file paths)
│               └── lifecycle.py # PluginLifecycle init/shutdown
├── tests/                        # pytest suite (954 tests)
│   ├── fuzz/                     # Query and binary parser fuzz tests
├── recovered/                    # Carved file output (gitignored)
├── logs/                         # Log files (gitignored)
├── signatures/                   # Format signature definitions
├── pyproject.toml
├── requirements.txt
├── CHANGELOG.md
├── LICENSE
└── README.md

Key design decisions

  • BaseCarver — abstract class that enforces a single carve(data: bytes) -> list[CarvedFile] contract. Every format-specific carver (JPEG, PNG, …) is a self-contained subclass.
  • RawReader — context-managed, read-only binary reader. Works on both files and block devices. Provides read_at(offset, size) and read_sector(sector) for flexible access.
  • RecoveryManager — tracks a counter per file extension so output names are deterministic (jpeg_001.jpg, jpeg_002.jpg, …). Output directory is created automatically.
  • Signature registrysignatures.py is a single dict that maps format keys to FileSignature instances. Adding a format is a one-liner here plus a carver class.

Adding a new file format

  1. Add a FileSignature to src/recoverx/core/carving/signatures.py
  2. Create a carver in src/recoverx/core/carving/ that extends BaseCarver
  3. Wire it into the scan pipeline in cli/commands/scan.py
# signatures.py
SIGNATURES["png"] = FileSignature(
    name="PNG", extension="png",
    header=b"\x89PNG\r\n\x1a\n",
    footer=b"\x00\x00\x00\x00IEND\xae\x42\x60\x82",
    min_size=67,
)

# png.py
from .base import BaseCarver, CarvedFile
from .signatures import SIGNATURES

class PNGCarver(BaseCarver):
    def __init__(self):
        super().__init__(SIGNATURES["png"])

    def carve(self, data: bytes) -> list[CarvedFile]:
        # Implementation follows the same header/footer pattern as JPEGCarver
        ...

Future roadmap

Feature Status
JPEG carving ✅ Done
PNG carving ✅ Done
GIF carving ✅ Done
BMP carving ✅ Done
PDF carving ✅ Done
SHA-256 hashing ✅ Done
Hash database ✅ Done
Scan benchmarking ✅ Done
Chunked streaming ✅ Done
Memory-mapped scanner ✅ Done
Multithreaded scanner ✅ Done
JSON forensic reports ✅ Done
Filesystem detection ✅ Done
Direct disk access ✅ Done
FAT32 parsing ✅ Done
FAT32 file recovery ✅ Done
CI/CD automation ✅ Done
Fuzz testing ✅ Done
Static analysis (mypy+bandit) ✅ Done
Performance profiling ✅ Done
Recovery validation ✅ Done
ZIP carving 🔜 Planned
NTFS parsing ✅ Done
NTFS non-resident recovery ✅ Done
NTFS runlist engine ✅ Done
NTFS sparse file support ✅ Done
NTFS deleted non-resident recovery ✅ Done
NTFS USN journal parser ✅ Done
NTFS $LogFile parser ✅ Done
Forensic timeline engine ✅ Done
Forensic event abstraction ✅ Done
Forensic correlation ✅ Done
Forensic indexing engine ✅ Done
Forensic query engine ✅ Done
Case management ✅ Done
Artifact abstraction ✅ Done
Forensic reporting ✅ Done
Correlation Engine V2 ✅ Done
Event Graph Engine ✅ Done
Distributed Foundation ✅ Done
Remote Acquisition ✅ Done
Plugin SDK ✅ Done
Analyzer Framework ✅ Done
Findings Engine ✅ Done
Query Optimization ✅ Done
Forensic Export Bundle ✅ Done
Performance & Scalability ✅ Done
Case CLI ✅ Done
SSD/TRIM awareness 🔜 Planned
ReFS / APFS support 🔜 Planned
GUI (optional) 🔜 Planned

License

Distributed under the MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

recoverx-0.8.2.tar.gz (176.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

recoverx-0.8.2-py3-none-any.whl (169.4 kB view details)

Uploaded Python 3

File details

Details for the file recoverx-0.8.2.tar.gz.

File metadata

  • Download URL: recoverx-0.8.2.tar.gz
  • Upload date:
  • Size: 176.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for recoverx-0.8.2.tar.gz
Algorithm Hash digest
SHA256 c0bafb0918a78e5e93dd0788aa3415fe08aba14bdb856731b2646c7a2d602704
MD5 0fa8b44fb8e56b32c74605f815925df3
BLAKE2b-256 ae2eb8a183bafe6ae3001a4428424f96685817218b4fa44a9454c9a628af38dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for recoverx-0.8.2.tar.gz:

Publisher: publish.yml on reimondpc/recoverx-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file recoverx-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: recoverx-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 169.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for recoverx-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 db8ebf6ae2796fe8b056ef40c27b2a3b811abc87b1b25188d8ab05531dfaae5c
MD5 9c32904ce162a26c980ed906cb08bb78
BLAKE2b-256 63e64145f41651f0b7a416ec2ddb38ee87286a8e678c7e25e87118023079c75e

See more details on using hashes here.

Provenance

The following attestation bundles were made for recoverx-0.8.2-py3-none-any.whl:

Publisher: publish.yml on reimondpc/recoverx-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page