Skip to main content

A Python tool to crawl an IMAP inbox and extract emails as structured records

Project description

imapcrawler

a python tool to crawl an imap inbox and extract emails as structured records for other tools

IMAP Email Inbox Crawler

A Python tool for crawling emails from an IMAP server, fetching raw email data, and cleaning/parsing the content for later analysis or use in applications like knowledge bases.

Features

  • Fetch emails from IMAP servers (including Gmail, Outlook, etc.)
  • Support for querying by date ranges, months, or specific days
  • Raw email extraction with metadata
  • Email content cleaning and parsing (HTML to text, quote removal, etc.)
  • Configurable file handling modes (merge, overwrite, raise)
  • Persistent configuration storage
  • Command-line interface for easy automation

Installation

To install this package in development mode, run:

pip install imapcrawler

The package has the following optional dependencies which will improve quality of text extraction:

  • tqdm
  • beautifulsoup4
  • python-dateutil
  • mail-parser-reply

the package can be installed with all optional dependencies like this

pip install imapcrawler[all]

Usage

NOTE: if installed via pip you can either use python imapcrawler.py or just imapcrawler.

Basic Commands

  1. Set up configuration:

    python imapcrawler.py config-set --server imap.example.com --email user@example.com
    
  2. Download raw emails:

    python imapcrawler.py download --month 2023-06 --limit 100
    
  3. Clean downloaded emails:

    python imapcrawler.py clean
    

Command Reference

config-set

Set persistent configuration values interactively or via arguments.

config-show

Display current configuration.

config-clear

Clear all configuration values.

config-default

Reset configuration to factory defaults.

download

Fetch raw emails from IMAP server.

Options:

  • --date - Specific date (YYYY-MM-DD)
  • --month - Month to query (YYYY-MM)
  • --limit - Limit number of emails (-1 for all)
  • --diff - Skip already known emails
  • --filepath_raw - Output file for raw emails
  • --filepath_clean - Output file for cleaned emails

clean

Process raw emails and save cleaned version.

peek-raw / peek-clean

Show a random email from raw or cleaned files.

Configuration File

Configuration is stored in ~/.imapcrawler_config.json and includes:

  • Server address
  • Email address
  • File paths for raw and cleaned emails
  • File handling mode

Output Files

  • emails_raw.jsonl: Raw email data with full metadata
  • emails.jsonl: Cleaned email data with processed content

Example Workflow

# Configure once
python imapcrawler.py config-set --server imap.gmail.com --email user@gmail.com

# Download emails from June 2023
python imapcrawler.py download --month 2023-06 --limit 500

# Clean the downloaded emails
python imapcrawler.py clean

# View a sample
python imapcrawler.py peek-clean

Notes

  • Passwords are prompted securely when not provided via command line
  • The tool handles large email volumes efficiently with progress bars
  • Supports various IMAP servers including Gmail, Outlook, and custom servers
  • Email cleaning removes HTML tags, quoted text, and signature blocks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imapcrawler-0.1.2.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imapcrawler-0.1.2-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file imapcrawler-0.1.2.tar.gz.

File metadata

  • Download URL: imapcrawler-0.1.2.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for imapcrawler-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0d7f906ea0faadbe45eb84242fcb258ab0e945500a59ce5120413017fb767f48
MD5 54fd6f6d745ba706cee4b1ccd0d614f4
BLAKE2b-256 82e22ad04cbb52bfb10d718c5090631c2cd3f5611242c946232123fb0719aaf8

See more details on using hashes here.

File details

Details for the file imapcrawler-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: imapcrawler-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for imapcrawler-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 77a19b57ff88c8efbeef63d2b9a3c7222fef159f9104d630c297289b8bff1d9f
MD5 7c6d1a6e9c725f564cf15cb9fb3b9320
BLAKE2b-256 757ecc67a4f9ffbe9206f249cc95a24cf59044b0165bd1faf7cdcd18aedfc296

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page