A Python tool to crawl an IMAP inbox and extract emails as structured records
Project description
imapcrawler
a python tool to crawl an imap inbox and extract emails as structured records for other tools
IMAP Email Inbox Crawler
A Python tool for crawling emails from an IMAP server, fetching raw email data, and cleaning/parsing the content for later analysis or use in applications like knowledge bases.
Features
- Fetch emails from IMAP servers (including Gmail, Outlook, etc.)
- Support for querying by date ranges, months, or specific days
- Raw email extraction with metadata
- Email content cleaning and parsing (HTML to text, quote removal, etc.)
- Configurable file handling modes (merge, overwrite, raise)
- Persistent configuration storage
- Command-line interface for easy automation
Installation
To install this package in development mode, run:
pip install imapcrawler
The package has the following optional dependencies which will improve quality of text extraction:
tqdmbeautifulsoup4python-dateutilmail-parser-reply
the package can be installed with all optional dependencies like this
pip install imapcrawler[all]
Usage
NOTE: if installed via pip you can either use python imapcrawler.py or just imapcrawler.
Basic Commands
-
Set up configuration:
python imapcrawler.py config-set --server imap.example.com --email user@example.com
-
Download raw emails:
python imapcrawler.py download --month 2023-06 --limit 100
-
Clean downloaded emails:
python imapcrawler.py clean
Command Reference
config-set
Set persistent configuration values interactively or via arguments.
config-show
Display current configuration.
config-clear
Clear all configuration values.
config-default
Reset configuration to factory defaults.
download
Fetch raw emails from IMAP server.
Options:
--date- Specific date (YYYY-MM-DD)--month- Month to query (YYYY-MM)--limit- Limit number of emails (-1 for all)--diff- Skip already known emails--filepath_raw- Output file for raw emails--filepath_clean- Output file for cleaned emails
clean
Process raw emails and save cleaned version.
peek-raw / peek-clean
Show a random email from raw or cleaned files.
Configuration File
Configuration is stored in ~/.imapcrawler_config.json and includes:
- Server address
- Email address
- File paths for raw and cleaned emails
- File handling mode
Output Files
emails_raw.jsonl: Raw email data with full metadataemails.jsonl: Cleaned email data with processed content
Example Workflow
# Configure once
python imapcrawler.py config-set --server imap.gmail.com --email user@gmail.com
# Download emails from June 2023
python imapcrawler.py download --month 2023-06 --limit 500
# Clean the downloaded emails
python imapcrawler.py clean
# View a sample
python imapcrawler.py peek-clean
Notes
- Passwords are prompted securely when not provided via command line
- The tool handles large email volumes efficiently with progress bars
- Supports various IMAP servers including Gmail, Outlook, and custom servers
- Email cleaning removes HTML tags, quoted text, and signature blocks
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file imapcrawler-0.1.2.tar.gz.
File metadata
- Download URL: imapcrawler-0.1.2.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d7f906ea0faadbe45eb84242fcb258ab0e945500a59ce5120413017fb767f48
|
|
| MD5 |
54fd6f6d745ba706cee4b1ccd0d614f4
|
|
| BLAKE2b-256 |
82e22ad04cbb52bfb10d718c5090631c2cd3f5611242c946232123fb0719aaf8
|
File details
Details for the file imapcrawler-0.1.2-py3-none-any.whl.
File metadata
- Download URL: imapcrawler-0.1.2-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77a19b57ff88c8efbeef63d2b9a3c7222fef159f9104d630c297289b8bff1d9f
|
|
| MD5 |
7c6d1a6e9c725f564cf15cb9fb3b9320
|
|
| BLAKE2b-256 |
757ecc67a4f9ffbe9206f249cc95a24cf59044b0165bd1faf7cdcd18aedfc296
|