Advanced file metadata analysis and security tool
Project description
MetaScout
Project Overview
MetaScout is a comprehensive metadata security analyzer designed for detecting, analyzing, and securing sensitive information hidden in file metadata across multiple file formats. The tool extracts deep metadata from images, documents, PDFs, audio, video, and executable files, identifying privacy risks and security concerns while providing reporting and redaction capabilities.
Installation
MetaScout requires Python 3.8 or later and can be installed using several methods:
Option 1: Install from PyPI (Recommended)
# Install the base package
pip install metascout
# Install with all optional dependencies
pip install "metascout[full]"
# Install specific feature sets
pip install "metascout[document,executable]"
Option 2: Install from Source
# Clone the repository
git clone https://github.com/ParleSec/metascout.git
cd metascout
# Install the package
pip install .
# Or install in development mode
pip install -e .
Platform-Specific Considerations
Windows
On Windows, the package installs the metascout.exe script to your Python Scripts directory. If this directory is not in your PATH, you can:
# Add to PATH (PowerShell)
$env:PATH += ";$env:USERPROFILE\AppData\Roaming\Python\Python3x\Scripts"
# Or run with full path
$env:USERPROFILE\AppData\Roaming\Python\Python3x\Scripts\metascout.exe
Linux/macOS
On Unix-like systems, ensure you have the required build dependencies for certain optional packages:
# Ubuntu/Debian
sudo apt-get install python3-dev libmagic-dev libfuzzy-dev
# macOS (using Homebrew)
brew install libmagic ssdeep
Available Feature Sets
MetaScout uses optional dependencies for specialized features:
document: Dependencies for enhanced document analysisexecutable: Dependencies for executable file analysissecurity: Enhanced security analysis features (YARA, ssdeep)full: All optional dependencies
Verifying Installation
After installation, verify that MetaScout is working correctly:
# Check version
metascout --version
# Run self-tests
metascout test
# Test analyze command with a sample file
metascout analyze path/to/any/file.jpg
Troubleshooting
If you encounter installation issues, try these steps:
-
Ensure Python 3.8+ is installed and in your PATH:
python --version -
If you get "command not found" errors:
# Find where the metascout script was installed pip show metascout # Use python module directly python -m metascout
-
For dependency issues with optional packages:
# Skip problematic dependency pip install metascout --no-deps pip install -r requirements.txt --skip-failed
Purpose & Motivation
Why MetaScout Exists
Files contain more information than you can see. Behind the visible content lies metadata - information about who created the file, when, where, with what software, and sometimes even location data or personally identifiable information (PII). MetaScout exposes this hidden information to:
- Identify privacy leaks in files before sharing them
- Detect potential security risks in received documents
- Create clean copies of files with sensitive metadata removed
- Verify file authenticity and identify potential manipulation
- Support compliance requirements for handling personal data
MetaScout is particularly valuable for security professionals, privacy-conscious individuals, and organizations that need to ensure documents they create or share don't contain unintentional leaks of sensitive information.
Architecture
System Structure
core/: Core data models and processing logicmodels.py: Data models for file metadata and findingsprocessor.py: Core file processing logicutils.py: Utility functions for hashing, file operations, etc.
extractors/: File-specific metadata extractorsbase.py: Base extractor interfaceimage.py,document.py,audio.py, etc.: Format-specific extractors
analyzers/: Analysis modules for different security concernsbase.py: Base analyzer interfacepattern.py: Pattern matching for PII detectionimage.py,document.py, etc.: Format-specific analyzers
operations/: High-level operations (analyze, batch, compare, redact)reporters/: Report generators for different output formatsconfig/: Configuration and constantscli.py: Command-line interface
The modular design ensures each component has a single responsibility, making the codebase maintainable and extensible. New file formats, analysis techniques, or output formats can be added with minimal changes to the core system.
Metadata Analysis Components
- Extractors: Format-specific modules that extract raw metadata
- Analyzers: Modules that evaluate metadata for privacy/security issues
- Processors: Core logic for orchestrating the analysis pipeline
- Reporters: Formatters for different output requirements
Key Features
🔍 Deep Metadata Extraction
- EXIF data from images including GPS coordinates and device info
- Document properties from PDFs, Office documents (author, software, etc.)
- ID3 tags and embedded data in audio and video files
- Headers, libraries, and signatures from executable files
🚨 Privacy & Security Analysis
- PII detection through pattern matching (emails, phone numbers, SSNs)
- Location data detection in images and documents
- Suspicious patterns in executable files
- Document revision history and hidden content detection
📊 Comprehensive Reporting
- Multiple output formats (text, JSON, CSV, HTML)
- Severity-based finding classification (high, medium, low)
- Detailed file information including hashes and timestamps
- Visual reports with expandable sections in HTML format
🔄 Batch Processing & Comparison
- Recursive directory scanning with filtering
- Multi-threaded processing for large file collections
- Side-by-side metadata comparison between files
- Fuzzy hash comparison for similarity detection
✂️ Metadata Redaction
- Selective or complete metadata removal
- Creation of clean copies for sharing
- Preservation of essential metadata when needed
- Support for various file formats including images and PDFs
🔒 Advanced Analysis
- YARA integration for custom pattern detection
- Fuzzy hashing for file similarity analysis
- Detailed executable analysis for security risks
- Support for password-protected documents
Example Code
from metascout import process_file, process_files
# Analyze a single file
result = process_file("image.jpg")
print(f"Found {len(result.findings)} issues in {result.file_path}")
# Process findings
for finding in result.findings:
if finding.severity == "high":
print(f"[{finding.severity.upper()}] {finding.description}")
for key, value in finding.data.items():
print(f" {key}: {value}")
# Batch process multiple files
results = process_files(["file1.pdf", "file2.docx", "file3.jpg"])
Core Dependencies
pillow- Image file handling and EXIF extractionPyPDF2- PDF metadata extraction and manipulationmutagen- Audio file metadata extractionpython-magic- File type detectionexifread- Enhanced EXIF data extractioncolorama- Terminal color formattingtabulate- Table formatting for reportstqdm- Progress bars for batch processing
Optional Dependencies:
yara-python- Pattern matching using YARA rulesssdeep- Fuzzy hash comparisonpython-docx/openpyxl- Office document analysispefile/pyelftools/macholib- Executable analysis
Usage & CLI
MetaScout CLI provides multiple commands for different metadata operations:
Single File Analysis
# Basic file analysis with text output
metascout analyze image.jpg
# Generate an HTML report for a PDF
metascout analyze document.pdf --format html --output report.html
# Skip hash computation for faster analysis
metascout analyze large_file.mp4 --skip-hashes
Example output:
File: image.jpg
Path: /path/to/image.jpg
Type: image (image/jpeg)
Size: 2,345,678 bytes
Analysis Findings:
[PRIVACY] GPS location data found in EXIF metadata
source: EXIF
field: GPSInfo
[PRIVACY] Device information found
device_info: {'Make': 'Apple', 'Model': 'iPhone 12'}
[INFORMATION] Image creation timestamp found
dates: {'DateTimeOriginal': '2025:04:15 14:32:45'}
Batch Processing
# Process all files in a directory recursively
metascout batch /path/to/files --recursive
# Process only JPG files, excluding thumbnails
metascout batch /data/photos --filter "*.jpg" --exclude "*thumb*" --recursive
# Generate an HTML report for all documents
metascout batch /path/to/documents --format html --output report.html
File Comparison
# Compare metadata between two files
metascout compare original.pdf modified.pdf
# Compare with HTML output
metascout compare file1.docx file2.docx --format html --output comparison.html
# Use fuzzy hashing for similarity detection
metascout compare original.jpg similar.jpg --fuzzy-hash
Example comparison output:
File Metadata Comparison Report
==============================
File 1: original.pdf
Path: /path/to/original.pdf
Type: document
Size: 1,234,567 bytes
MD5: a1b2c3d4e5f6...
File 2: modified.pdf
Path: /path/to/modified.pdf
Type: document
Size: 1,245,678 bytes
MD5: f6e5d4c3b2a1...
TIMESTAMPS
----------
creation_time:
File 1: 2025-04-10T12:34:56
File 2: 2025-04-15T09:12:34
METADATA FIELDS
--------------
document_info.Author:
File 1: Original Author
File 2: Modified Author
Metadata Redaction
# Create a clean copy with all metadata removed
metascout redact confidential.jpg public.jpg
# Keep specific metadata fields while removing others
metascout redact document.pdf redacted.pdf --keep title author
Additional Commands
# Get detailed information about global options
metascout --help
# Get detailed information about a specific command
metascout analyze --help
# Enable verbose output for any command
metascout analyze image.jpg --verbose
Advanced Use Cases
Privacy Audit Workflow
For organizations looking to audit documents for privacy compliance:
# 1. Scan a directory of documents for PII
metascout batch /path/to/documents --recursive --output audit.html --format html
# 2. Create clean versions of documents with issues
mkdir clean-documents
for file in $(grep -l "HIGH" audit.txt | cut -d ':' -f1); do
metascout redact "$file" "clean-documents/$(basename "$file")" --keep title
done
Security Investigation
For examining suspicious files:
# 1. Extract and analyze metadata from suspicious files
metascout analyze suspicious.exe --yara-rules security-rules.yar
# 2. Compare with known samples
metascout compare suspicious.exe reference.exe --fuzzy-hash
Media File Management
For photographers or media professionals:
# 1. Check images for GPS data before posting online
for img in *.jpg; do
metascout analyze "$img" --filter "GPS"
done
# 2. Create versions safe for sharing
mkdir web-safe
for img in *.jpg; do
metascout redact "$img" "web-safe/$(basename "$img")" --keep copyright
done
License
This project is licensed under the MIT License. See LICENSE for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metascout-1.0.0.tar.gz.
File metadata
- Download URL: metascout-1.0.0.tar.gz
- Upload date:
- Size: 64.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1cd6c76b5abd7887034c33f4b69b5f33b03e840f6ed965e54f356a13812116ed
|
|
| MD5 |
27be92371cd87721c5f6901bc2cd55cf
|
|
| BLAKE2b-256 |
8e20e078a9f48d80bfba0c8a79a432a458f7eec074aa16436daf10d3d24bf49a
|
File details
Details for the file metascout-1.0.0-py3-none-any.whl.
File metadata
- Download URL: metascout-1.0.0-py3-none-any.whl
- Upload date:
- Size: 76.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2ffdb7b6e93ad24a6899ddda798580a52d8fbe8fe6fbcaf9fbedd6602b68a10
|
|
| MD5 |
c413f1d01eb9127a44cce63c7af4e3cd
|
|
| BLAKE2b-256 |
a59ce81a0688364fd48288c45dc41b12e307ca1035fb0be7c7330cbfad72e8ab
|