Skip to main content

A Python tool to crawl an IMAP inbox and extract emails as structured records

Project description

imapcrawler

a python tool to crawl an imap inbox and extract emails as structured records for other tools

IMAP Email Inbox Crawler

A Python tool for crawling emails from an IMAP server, fetching raw email data, and cleaning/parsing the content for later analysis or use in applications like knowledge bases.

Features

  • Fetch emails from IMAP servers (including Gmail, Outlook, etc.)
  • Support for querying by date ranges, months, or specific days
  • Raw email extraction with metadata
  • Email content cleaning and parsing (HTML to text, quote removal, etc.)
  • Configurable file handling modes (merge, overwrite, raise)
  • Persistent configuration storage
  • Command-line interface for easy automation

Installation

To install this package in development mode, run:

pip install imapcrawler

The package has the following optional dependencies which will improve quality of text extraction:

  • tqdm
  • beautifulsoup4
  • python-dateutil
  • mail-parser-reply

the package can be installed with all optional dependencies like this

pip install imapcrawler[all]

Usage

NOTE: if installed via pip you can either use python imapcrawler.py or just imapcrawler.

Basic Commands

  1. Set up configuration:

    python imapcrawler.py config-set --server imap.example.com --email user@example.com
    
  2. Download raw emails:

    python imapcrawler.py download --month 2023-06 --limit 100
    
  3. Clean downloaded emails:

    python imapcrawler.py clean
    

Command Reference

config-set

Set persistent configuration values interactively or via arguments.

config-show

Display current configuration.

config-clear

Clear all configuration values.

config-default

Reset configuration to factory defaults.

download

Fetch raw emails from IMAP server.

Options:

  • --date - Specific date (YYYY-MM-DD)
  • --month - Month to query (YYYY-MM)
  • --limit - Limit number of emails (-1 for all)
  • --diff - Skip already known emails
  • --filepath_raw - Output file for raw emails
  • --filepath_clean - Output file for cleaned emails

clean

Process raw emails and save cleaned version.

peek-raw / peek-clean

Show a random email from raw or cleaned files.

Configuration File

Configuration is stored in ~/.imapcrawler_config.json and includes:

  • Server address
  • Email address
  • File paths for raw and cleaned emails
  • File handling mode

Output Files

  • emails_raw.jsonl: Raw email data with full metadata
  • emails.jsonl: Cleaned email data with processed content

Example Workflow

# Configure once
python imapcrawler.py config-set --server imap.gmail.com --email user@gmail.com

# Download emails from June 2023
python imapcrawler.py download --month 2023-06 --limit 500

# Clean the downloaded emails
python imapcrawler.py clean

# View a sample
python imapcrawler.py peek-clean

Notes

  • Passwords are prompted securely when not provided via command line
  • The tool handles large email volumes efficiently with progress bars
  • Supports various IMAP servers including Gmail, Outlook, and custom servers
  • Email cleaning removes HTML tags, quoted text, and signature blocks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imapcrawler-0.1.1.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imapcrawler-0.1.1-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file imapcrawler-0.1.1.tar.gz.

File metadata

  • Download URL: imapcrawler-0.1.1.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for imapcrawler-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0b86168291e53fc2b548d931589895ea678ce236e7ffaa278cff2b6b8af15298
MD5 1975c84b5bd6010f57dfdb42f8a0c133
BLAKE2b-256 840f2af09065af072bbab2cc7253833a552f58d69f41e3b8d9b23fa9a2f23bc3

See more details on using hashes here.

File details

Details for the file imapcrawler-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: imapcrawler-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for imapcrawler-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ef5a769c01d0f318249bab803fc67fcab41536e6e539e9eafca96181d116e51c
MD5 cdb4e059a38eab8bd50c2056b633dd43
BLAKE2b-256 d9697034fe9f5e35b2946d9f78c3bb44cfb8d084a3440b724ac3c5a946849e23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page