Skip to main content

A Python CLI tool for analyzing email data in mbox format

Project description

SWECC Email Scraper

A Python CLI tool for analyzing email data in mbox format.

Features

  • 📧 Process mbox format email archives
  • 🔧 Unix-style pipeline architecture for flexible processing
  • 📊 Extendable framework for building analysis pipelines
  • Coming soon: More analysis processors...

Installation

From PyPI

pip install swecc-email-scraper

From Source

git clone https://github.com/swecc-uw/swecc-email-scraper.git
cd swecc-email-scraper
pip install -e ".[dev]"  # Install with development dependencies

# Run tests
pytest

Quick Start

The tool uses Unix pipes to compose commands. Each command does one thing and can be combined with others:

  1. Basic usage - get email stats with example processor:
swecc-email-scraper read mailbox.mbox \
  | swecc-email-scraper stats \
  | swecc-email-scraper format -f json > results.json
  1. List available processors:
swecc-email-scraper list-processors
  1. List available output formats:
swecc-email-scraper list-formats

Command Reference

Read Command

Reads an mbox file and outputs email data as JSON:

swecc-email-scraper read input.mbox > emails.json

Stats Command

Processes email data from stdin and outputs statistics:

cat emails.json | swecc-email-scraper stats > stats.json

Format Command

Formats JSON data using the specified formatter:

cat stats.json \
  | swecc-email-scraper format -f json \
  > formatted.json

Pipeline Examples

  1. Basic email statistics to terminal:
swecc-email-scraper read inbox.mbox \
  | swecc-email-scraper stats \
  | swecc-email-scraper format
  1. Save analysis to a file:
swecc-email-scraper read inbox.mbox \
  | swecc-email-scraper stats \
  > analysis.json
  1. Process with custom formatting:
swecc-email-scraper read inbox.mbox \
  | swecc-email-scraper stats \
  | swecc-email-scraper format -f json \
  > analysis.json
  1. Use with Unix tools:
# Filter emails before analysis
swecc-email-scraper read inbox.mbox \
  | jq 'map(select(.sender | contains("important")))' \
  | swecc-email-scraper stats

Extending the Tool

The tool is designed to be easily extensible. See CONTRIBUTING.md for detailed information on:

  • Creating custom processors
  • Adding new output formats
  • Contributing to the project
  • Development setup and guidelines

Architecture

The tool uses a Unix pipeline architecture where:

  1. read command converts mbox files to JSON email data
  2. Processor commands (like stats) transform or analyze the data
  3. format command handles output formatting
  4. Standard Unix pipes (|) connect the components

License

MIT License - See LICENSE file for details.

Acknowledgments

Developed as part of SWECC Labs at the University of Washington.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swecc_email_scraper-0.1.2.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swecc_email_scraper-0.1.2-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file swecc_email_scraper-0.1.2.tar.gz.

File metadata

  • Download URL: swecc_email_scraper-0.1.2.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for swecc_email_scraper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e299c61d9c681f8f4d65131659be60739bc9664df33a2dd5aad7e53cb213c897
MD5 3a72b67c74db6b873c74a70602473615
BLAKE2b-256 5320f44cffaeb2f7796b0ac8fae0fa8a37767c8998fa096cba484e906d943ad1

See more details on using hashes here.

File details

Details for the file swecc_email_scraper-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for swecc_email_scraper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 72c1b86abfedda1a5e85fc90f3c32c30dcb3eae873793d4100687e0bd46adba1
MD5 8ac725c15b80a039137d1660d7b30e16
BLAKE2b-256 ef3612122affcece4df15d5031c70e67731ab91d21b45c7d2b342ed216f37181

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page