A Python CLI tool for analyzing email data in mbox format
Project description
SWECC Email Scraper
A Python CLI tool for analyzing email data in mbox format.
Features
- 📧 Process mbox format email archives
- 🔧 Unix-style pipeline architecture for flexible processing
- 📊 Extendable framework for building analysis pipelines
- Coming soon: More analysis processors...
Installation
From PyPI
pip install swecc-email-scraper
From Source
git clone https://github.com/swecc-uw/swecc-email-scraper.git
cd swecc-email-scraper
pip install -e ".[dev]" # Install with development dependencies
# Run tests
pytest
Quick Start
The tool uses Unix pipes to compose commands. Each command does one thing and can be combined with others:
- Basic usage - get email stats with example processor:
swecc-email-scraper read mailbox.mbox \
| swecc-email-scraper stats \
| swecc-email-scraper format -f json > results.json
- List available processors:
swecc-email-scraper list-processors
- List available output formats:
swecc-email-scraper list-formats
Command Reference
Read Command
Reads an mbox file and outputs email data as JSON:
swecc-email-scraper read input.mbox > emails.json
Stats Command
Processes email data from stdin and outputs statistics:
cat emails.json | swecc-email-scraper stats > stats.json
Format Command
Formats JSON data using the specified formatter:
cat stats.json \
| swecc-email-scraper format -f json \
> formatted.json
Pipeline Examples
- Basic email statistics to terminal:
swecc-email-scraper read inbox.mbox \
| swecc-email-scraper stats \
| swecc-email-scraper format
- Save analysis to a file:
swecc-email-scraper read inbox.mbox \
| swecc-email-scraper stats \
> analysis.json
- Process with custom formatting:
swecc-email-scraper read inbox.mbox \
| swecc-email-scraper stats \
| swecc-email-scraper format -f json \
> analysis.json
- Use with Unix tools:
# Filter emails before analysis
swecc-email-scraper read inbox.mbox \
| jq 'map(select(.sender | contains("important")))' \
| swecc-email-scraper stats
Extending the Tool
The tool is designed to be easily extensible. See CONTRIBUTING.md for detailed information on:
- Creating custom processors
- Adding new output formats
- Contributing to the project
- Development setup and guidelines
Architecture
The tool uses a Unix pipeline architecture where:
readcommand converts mbox files to JSON email data- Processor commands (like
stats) transform or analyze the data formatcommand handles output formatting- Standard Unix pipes (
|) connect the components
License
MIT License - See LICENSE file for details.
Acknowledgments
Developed as part of SWECC Labs at the University of Washington.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swecc_email_scraper-0.1.2.tar.gz.
File metadata
- Download URL: swecc_email_scraper-0.1.2.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e299c61d9c681f8f4d65131659be60739bc9664df33a2dd5aad7e53cb213c897
|
|
| MD5 |
3a72b67c74db6b873c74a70602473615
|
|
| BLAKE2b-256 |
5320f44cffaeb2f7796b0ac8fae0fa8a37767c8998fa096cba484e906d943ad1
|
File details
Details for the file swecc_email_scraper-0.1.2-py3-none-any.whl.
File metadata
- Download URL: swecc_email_scraper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72c1b86abfedda1a5e85fc90f3c32c30dcb3eae873793d4100687e0bd46adba1
|
|
| MD5 |
8ac725c15b80a039137d1660d7b30e16
|
|
| BLAKE2b-256 |
ef3612122affcece4df15d5031c70e67731ab91d21b45c7d2b342ed216f37181
|