Skip to main content

A Python tool to crawl an IMAP inbox and extract emails as structured records

Project description

imapcrawler

a python tool to crawl an imap inbox and extract emails as structured records for other tools

IMAP Email Inbox Crawler

A Python tool for crawling emails from an IMAP server, fetching raw email data, and cleaning/parsing the content for later analysis or use in applications like knowledge bases.

Features

  • Fetch emails from IMAP servers (including Gmail, Outlook, etc.)
  • Support for querying by date ranges, months, or specific days
  • Raw email extraction with metadata
  • Email content cleaning and parsing (HTML to text, quote removal, etc.)
  • Configurable file handling modes (merge, overwrite, raise)
  • Persistent configuration storage
  • Command-line interface for easy automation

Installation

To install this package in development mode, run:

pip install imapcrawler

The package has the following optional dependencies which will improve quality of text extraction:

  • tqdm
  • beautifulsoup4
  • python-dateutil
  • mail-parser-reply

the package can be installed with all optional dependencies like this

pip install imapcrawler[all]

Usage

NOTE: if installed via pip you can either use python imapcrawler.py or just imapcrawler.

Basic Commands

  1. Set up configuration:

    python imapcrawler.py config-set --server imap.example.com --email user@example.com
    
  2. Download raw emails:

    python imapcrawler.py download --month 2023-06 --limit 100
    
  3. Clean downloaded emails:

    python imapcrawler.py clean
    

Command Reference

config-set

Set persistent configuration values interactively or via arguments.

config-show

Display current configuration.

config-clear

Clear all configuration values.

config-default

Reset configuration to factory defaults.

download

Fetch raw emails from IMAP server.

Options:

  • --date - Specific date (YYYY-MM-DD)
  • --month - Month to query (YYYY-MM)
  • --limit - Limit number of emails (-1 for all)
  • --diff - Skip already known emails
  • --filepath_raw - Output file for raw emails
  • --filepath_clean - Output file for cleaned emails

clean

Process raw emails and save cleaned version.

peek-raw / peek-clean

Show a random email from raw or cleaned files.

Configuration File

Configuration is stored in ~/.imapcrawler_config.json and includes:

  • Server address
  • Email address
  • File paths for raw and cleaned emails
  • File handling mode

Output Files

  • emails_raw.jsonl: Raw email data with full metadata
  • emails.jsonl: Cleaned email data with processed content

Example Workflow

# Configure once
python imapcrawler.py config-set --server imap.gmail.com --email user@gmail.com

# Download emails from June 2023
python imapcrawler.py download --month 2023-06 --limit 500

# Clean the downloaded emails
python imapcrawler.py clean

# View a sample
python imapcrawler.py peek-clean

Notes

  • Passwords are prompted securely when not provided via command line
  • The tool handles large email volumes efficiently with progress bars
  • Supports various IMAP servers including Gmail, Outlook, and custom servers
  • Email cleaning removes HTML tags, quoted text, and signature blocks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imapcrawler-0.1.0.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imapcrawler-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file imapcrawler-0.1.0.tar.gz.

File metadata

  • Download URL: imapcrawler-0.1.0.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for imapcrawler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d293a6530581893eb74fd829feba53c755c60960c5b50029d68df7e8195b1178
MD5 029ad047eac4b718f494f972d56524e6
BLAKE2b-256 67672b9ba7c5bf43ebdc7eb2ab99c833fd6449c84baa2c996c66173de2f1f760

See more details on using hashes here.

File details

Details for the file imapcrawler-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: imapcrawler-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for imapcrawler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ca0251ee26afec27005ce0eb6f91a369566624bc79955c9c430eedde06e3033
MD5 0e2c95f2f35f2235e8dc20f9f252f1d8
BLAKE2b-256 cdb2a842ded082145d0d064d9048c8ffe1fb66288e8297e8b2deb9747f732371

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page