A CLI tool to archive old Gmail messages to local mbox files with validation and safe deletion
Project description
Gmail Archiver
A comprehensive email archival and search solution for Gmail - Archive, compress, search, and manage your email history with confidence.
Why Gmail Archiver?
Gmail offers 15GB of free storage shared across Google services, but that space fills up quickly with years of emails, attachments, and files. While Gmail provides basic search and labels, it lacks:
- Local backup and control: Your emails are only in Google's cloud
- Long-term archival: No built-in way to archive and compress old emails while keeping them searchable
- Data portability: Difficult to export and search emails outside Gmail
- Storage optimization: No automatic compression or deduplication
- Search performance: Gmail search can be slow for large mailboxes
Gmail Archiver solves these problems by providing a professional-grade archival solution that:
- Archives old emails to portable mbox files (industry standard format)
- Searches archived emails with Gmail-style syntax (faster than Gmail itself!)
- Compresses archives with modern algorithms (zstd, lzma, gzip)
- Validates archives before deletion with multi-layer verification
- Manages your email history with deduplication and consolidation
- Protects your data with atomic transactions and safe deletion workflows
Key Benefits
- Reclaim Gmail storage: Archive old emails and safely delete them from Gmail
- Keep emails searchable: Lightning-fast full-text search (0.85ms for 1000 messages)
- Maintain data sovereignty: Your emails, your local storage, your control
- Future-proof format: mbox is a 40+ year old standard supported by all email clients
- Production-ready: 619 automated tests, 92% code coverage, strict type safety
🔔 Upgrading from v1.0.x?
See the Migration Guide for v1.0 → v1.1 upgrade instructions.
TL;DR: Run gmailarchiver migrate on first v1.1 run. Automatic backup included.
✨ New in v1.1.0
🔍 Full-Text Search (FTS5)
Search your archived messages with Gmail-style syntax:
- Search by sender:
from:alice@example.com - Search by subject:
subject:meeting - Search by date range:
after:2024-01-01 before:2024-12-31 - Full-text search:
invoice payment - Performance: 0.85ms for 1000 messages (118x faster than target)
📥 Import Existing Archives
Import mbox files from other tools or previous archives:
- Automatic metadata extraction
- Accurate offset calculation for fast access
- Support for compressed archives (gzip, lzma, zstd)
- Performance: 10,000+ messages per second
🔄 Deduplication
Remove duplicate messages across archives:
- 100% precision via RFC Message-ID
- Multiple strategies (newest, largest, first)
- Cross-archive detection
📦 Archive Consolidation
Merge multiple archives into one:
- Chronological sorting
- Integrated deduplication
- Automatic offset recalculation
- Performance: 10k messages in 3.57 seconds (16x faster than target)
⚡ Performance Improvements
| Component | Target | Achieved | Improvement |
|---|---|---|---|
| Search (1000 msgs) | <100ms | 0.85ms | 118x faster |
| Import (10k msgs) | <60s | <1s | 60x faster |
| Consolidate (10k msgs) | <60s | 3.57s | 16x faster |
✨ Features
- 📅 Smart Archiving: Archive emails older than a specified threshold (e.g., "3y", "6m", "30d")
- ♻️ Incremental Mode: Skip already-archived messages for efficient recurring runs
- 🗜️ Compression: Support for gzip, lzma, and zstd (fastest, Python 3.14 native)
- ✅ Multi-Layer Validation: Validate archives before deletion with checksums and spot-checks
- 🛡️ Safe Deletion Workflow:
- Archive-only mode (default, safe)
- Trash mode (30-day recovery window)
- Permanent deletion (with explicit confirmation)
- 📊 Progress Tracking: Real-time progress bars for long operations
- 💾 State Management: SQLite database tracks archived messages and run history
- ⚡ Batch Operations: Efficient API usage with automatic rate limiting
📦 Installation
Prerequisites
- Python 3.14+ (Download here)
- Gmail Account with email you want to archive
Note: OAuth2 credentials are bundled with the application. No manual Google Cloud setup required!
Install from PyPI (Recommended)
pip install gmail-archiver-cli
Or use pipx for isolated installation:
pipx install gmail-archiver-cli
Install from GitHub Release (Alternative)
- Go to the Releases page
- Download the latest
.whlfile - Install with pip:
# Replace VERSION with the latest version (e.g., 1.1.3)
pip install https://github.com/tumma72/GMailArchiver/releases/download/vVERSION/gmail_archiver_cli-VERSION-py3-none-any.whl
Verify Installation
gmailarchiver --version
gmailarchiver --help
🔐 First Run - OAuth2 Authorization
On first run, Gmail Archiver will automatically:
- Open your browser to Google's authorization page
- Ask you to sign in with your Google Account
- Request permission to access Gmail (read-only for archiving, modify for deletion)
- Save an authorization token to:
- Linux/macOS:
~/.config/gmailarchiver/token.json - Windows:
%APPDATA%\gmailarchiver\token.json
- Linux/macOS:
Security Note: The bundled OAuth2 credentials follow Google's security model for "installed applications". The client secret is not confidential for desktop apps - security comes from user consent at authorization time.
Using Custom OAuth2 Credentials (Optional)
If you prefer to use your own OAuth2 credentials:
- Create credentials in Google Cloud Console
- Enable the Gmail API
- Create "Desktop app" OAuth 2.0 credentials
- Download the credentials JSON file
- Use with
--credentialsflag:
gmailarchiver archive 3y --credentials /path/to/your/credentials.json
🚀 Quick Start
Basic Usage
# Preview what would be archived (dry run)
gmailarchiver archive 3y --dry-run
# Archive emails older than 3 years
gmailarchiver archive 3y
# Archive with zstd compression (recommended - fastest)
gmailarchiver archive 3y --compress zstd
# Archive with custom filename
gmailarchiver archive 6m --output my_archive.mbox.zst --compress zstd
Age Formats
| Format | Meaning |
|---|---|
3y |
3 years |
6m |
6 months |
2w |
2 weeks |
30d |
30 days |
Complete Workflow (Recommended)
# 1. Preview what will be archived
gmailarchiver archive 3y --dry-run
# 2. Archive without deletion (using zstd compression)
gmailarchiver archive 3y --compress zstd
# → Creates: archive_20250113.mbox.zst
# 3. Validate the archive
gmailarchiver validate archive_20250113.mbox.zst
# 4. Move emails to trash (reversible for 30 days)
gmailarchiver archive 3y --trash
# 5. (Optional) Permanent deletion after verification
# ⚠️ Only after you've verified the archive!
gmailarchiver archive 3y --delete
📝 All Commands
Archive Command
# Archive with different time periods
gmailarchiver archive 1y # 1 year old
gmailarchiver archive 6m # 6 months old
gmailarchiver archive 30d # 30 days old
# Archive with compression options
gmailarchiver archive 3y --compress zstd # zstd (fastest, recommended)
gmailarchiver archive 3y --compress gzip # gzip (more compatible)
gmailarchiver archive 3y --compress lzma # lzma (smallest size)
# Archive and delete
gmailarchiver archive 3y --trash # Move to trash (reversible)
gmailarchiver archive 3y --delete # Permanent delete (requires confirmation)
# Custom output file
gmailarchiver archive 6m --output old_emails.mbox.gz --compress gzip
Validation Command
# Validate any archive (auto-detects compression)
gmailarchiver validate archive_20250113.mbox
gmailarchiver validate archive_20250113.mbox.gz
gmailarchiver validate archive_20250113.mbox.zst
Status Command
# Show archiving statistics
gmailarchiver status
Authentication Commands
# Reset authentication (revoke and delete token)
gmailarchiver auth-reset
# Use custom credentials file
gmailarchiver archive 3y --credentials my_credentials.json
Migration Commands (v1.1+)
# Migrate v1.0 database to v1.1 (automatic on first run)
gmailarchiver migrate
# Show database schema version and statistics
gmailarchiver db-info
# Rollback to backup (if migration fails)
gmailarchiver rollback --backup-file archive_state.db.backup.20250114_120000
Search Commands (v1.1+)
# Search with Gmail-style syntax
gmailarchiver search "from:alice meeting"
gmailarchiver search "subject:invoice after:2024-01-01"
gmailarchiver search "payment" --limit 50
# Search with filters
gmailarchiver search --from alice@example.com --subject report
gmailarchiver search --after 2024-01-01 --before 2024-12-31
# JSON output for scripting
gmailarchiver search "invoice" --json
Import Commands (v1.1+)
# Import existing mbox archive
gmailarchiver import old_archive.mbox
# Import multiple archives with glob pattern
gmailarchiver import "archive_*.mbox.gz"
# Import with custom account ID
gmailarchiver import external.mbox --account-id backup_2024
Deduplication Commands (v1.1+)
# Analyze duplicates (preview only)
gmailarchiver dedupe-report
# Remove duplicates (with confirmation)
gmailarchiver dedupe --strategy newest
# Dry run
gmailarchiver dedupe --dry-run
Consolidation Commands (v1.1+)
# Merge multiple archives
gmailarchiver consolidate archive_*.mbox -o merged.mbox
# Merge with options
gmailarchiver consolidate old1.mbox old2.mbox -o consolidated.mbox.gz
gmailarchiver consolidate "archives/*.mbox" --no-sort --no-dedupe -o unsorted.mbox
gmailarchiver consolidate archive*.mbox -o merged.mbox.zst --dedupe-strategy newest
Enhanced Validation Commands (v1.1+)
# Verify mbox offset accuracy (v1.1 databases only)
gmailarchiver verify-offsets archive_20250114.mbox.gz
# Deep database consistency check
gmailarchiver verify-consistency archive_20250114.mbox.gz
Retry Failed Operations (v1.1+)
# Retry deletion after OAuth scope fix
gmailarchiver retry-delete archive_20250114.mbox --permanent
# Preview what will be retried (dry run)
gmailarchiver retry-delete archive_20250114.mbox --dry-run
🔄 Incremental Archiving
Gmail Archiver automatically tracks archived messages, so you can run it repeatedly without re-archiving the same emails:
# First run - archives all emails older than 3 years
gmailarchiver archive 3y --compress zstd
# Future runs - only archives NEW emails older than 3 years
gmailarchiver archive 3y --compress zstd
The tool maintains a SQLite database (archive_state.db) that tracks which messages have been archived.
🛡️ Safety Features
- Dry-run mode: Preview operations without making changes (
--dry-run) - Multi-layer validation: Before deletion, validate:
- Message count matches
- Database cross-check
- Content integrity (checksums)
- Spot-check sampling
- Trash-first workflow: Move to trash (reversible for 30 days) before permanent deletion
- Explicit confirmation: Must type exact phrase to confirm permanent deletion
- Incremental mode: Prevents duplicate archiving of messages
- Automatic rate limiting: Handles Gmail API limits with exponential backoff
- Atomic operations: Database transactions with auto-rollback on errors
⚡ Performance
Typical performance with Gmail API rate limits:
| Emails | Time |
|---|---|
| 10,000 | ~25-30 minutes |
| 50,000 | ~2-2.5 hours |
| 100,000 | ~4-5 hours |
Tips for large mailboxes:
- Use
--compress zstdfor fastest compression - Consider splitting into smaller date ranges
- Run during off-hours to avoid interruptions
🔧 Troubleshooting
Authentication Issues
Problem: "Credentials file not found" or authentication fails
Solution:
# Reset authentication
gmailarchiver auth-reset
# Then run any command to re-authenticate
gmailarchiver archive 3y --dry-run
Rate Limit Errors
Problem: "Rate limit exceeded" errors
Solution: The tool automatically retries with exponential backoff. For very large mailboxes, consider:
- Running during off-peak hours
- Splitting into smaller date ranges (e.g.,
1yinstead of5y)
Validation Failures
Problem: Archive validation fails
Solution: DO NOT delete until validation passes. Check:
- Archive file exists and is readable
- Sufficient disk space available
- State database not corrupted
- All messages were successfully archived
If validation continues to fail, keep the archive and do not delete from Gmail.
Disk Space
Problem: Running out of disk space
Solution:
- Use compression:
--compress zstd(typically 50-70% space savings) - Archive smaller time ranges
- Check available space before archiving:
df -h(Linux/macOS) ordir(Windows)
🤝 Contributing
We welcome contributions! See CONTRIBUTING.md for:
- Development setup
- Testing guidelines
- Code quality standards
- Pull request process
📄 License
Apache-2.0 License. See LICENSE for details.
⚠️ Disclaimer
This tool permanently deletes emails when using --delete. Always:
- ✅ Test with
--dry-runfirst - ✅ Validate archives before deletion
- ✅ Use
--trashfor reversible deletion - ✅ Keep backups of important emails
The authors are not responsible for data loss. Use at your own risk.
🔗 Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gmail_archiver_cli-1.1.4.tar.gz.
File metadata
- Download URL: gmail_archiver_cli-1.1.4.tar.gz
- Upload date:
- Size: 248.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bc380ad4a8f25d449a8421f0224183ae508452be43c60c069431eab44ca0c23
|
|
| MD5 |
370ddd19bf95431db13aba7da42448ba
|
|
| BLAKE2b-256 |
c6faef43f6172b1fcd744d31aae0b04eb4204376e55e02c0557527e683d863ce
|
File details
Details for the file gmail_archiver_cli-1.1.4-py3-none-any.whl.
File metadata
- Download URL: gmail_archiver_cli-1.1.4-py3-none-any.whl
- Upload date:
- Size: 83.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
021bce9bea5050d08f918b72f84346703c22fb77a611c3d26e449837626d37b9
|
|
| MD5 |
21e48ee5e8d51c31dc01b14041af5f05
|
|
| BLAKE2b-256 |
b15d5b8d35e37c84ccbd3c106efc597d90d5700b405cef4e5044eb5a9d7994d4
|