A fast, zero-config CLI tool to extract email addresses from text, files, and web pages.
Project description
email-harvest
A fast, zero-config CLI tool to extract email addresses from text, files, and web pages.
Features
- Extract from anywhere — text strings, local files, URLs, or entire websites
- Recursive crawling — follow links within a domain up to N levels deep
- Email deobfuscation — decodes
[at],(at), HTML entities (@), URL encoding (%40), and more - Smart filtering — auto-removes false positives (image filenames, placeholders, test domains)
- Domain allow/block lists — include or exclude specific email domains
- Source tracking — see which URL or file each email came from
- Multiple output formats — plain text, CSV, JSON, JSONL
- Polite crawling — rate limiting and robots.txt compliance built-in
- Proxy support — route requests through HTTP/HTTPS proxies
- Pipe-friendly — works with stdin/stdout for shell pipelines
- Python API — use as a library in your own scripts
- Minimal dependencies — only
requests+beautifulsoup4
Installation
pip install email-harvest
Requires Python 3.8+.
Quick Start
Extract emails from a web page:
emx https://example.com/contact
Extract from a local file:
emx contacts.html
Extract from multiple files using glob patterns:
emx "pages/*.html" --format json
Pipe text through stdin:
curl -s https://example.com | emx -
Crawl an entire website (depth 2, max 100 pages):
emx https://example.com --depth 2 --max-pages 100 -v
CLI Reference
emx [OPTIONS] SOURCE...
Sources are auto-detected:
https://...orhttp://...→ URL mode-→ read from stdin*.html→ glob pattern- anything else → file path
Crawling Options
| Flag | Default | Description |
|---|---|---|
--depth N |
0 |
Recursion depth for URLs (0 = single page) |
--max-pages N |
50 |
Maximum pages to crawl per URL |
--rate-limit SEC |
1.0 |
Seconds between requests |
--ignore-robots |
off | Ignore robots.txt restrictions |
--proxy URL |
— | HTTP/HTTPS proxy |
--timeout SEC |
10 |
Request timeout |
--user-agent STR |
EmailHarvest/1.0 |
Custom User-Agent |
Filtering Options
| Flag | Description |
|---|---|
--include-domain DOMAIN |
Only keep emails from this domain (repeatable) |
--exclude-domain DOMAIN |
Exclude emails from this domain (repeatable) |
--no-deobfuscate |
Disable email deobfuscation |
Output Options
| Flag | Default | Description |
|---|---|---|
--format, -f |
plain |
Output format: plain, csv, json, jsonl |
--output, -o FILE |
stdout | Write to file |
--with-source |
off | Show source URL/file for each email |
--sort |
off | Sort alphabetically |
--count |
off | Print only the count |
-v, --verbose |
off | Show progress on stderr |
-q, --quiet |
off | Suppress all non-output messages |
Output Examples
Plain (default):
alice@company.com
bob@company.com
JSON:
emx page.html --format json
[
"alice@company.com",
"bob@company.com"
]
CSV with source tracking:
emx https://example.com --format csv --with-source
email,source,source_type
alice@company.com,https://example.com/contact,url
bob@company.com,https://example.com/about,url
JSONL (streaming):
emx page.html --format jsonl
{"email": "alice@company.com"}
{"email": "bob@company.com"}
Python API
from email_extractor import EmailExtractor
ex = EmailExtractor()
# From text
emails = ex.from_text("Contact us at hello@company.org")
# ['hello@company.org']
# From file
emails = ex.from_file("contacts.html")
# From HTML with mailto: detection
emails = ex.from_html("<a href='mailto:hi@co.org'>Email us</a>")
# From glob pattern
emails = ex.from_glob("pages/**/*.html")
# With source tracking
results = ex.extract("hi@co.org", source="input.txt", source_type="file")
for r in results:
print(f"{r.email} from {r.source}")
# With domain filtering
ex = EmailExtractor(
include_domains=["company.org"],
exclude_domains=["spam.org"],
)
# Disable deobfuscation
ex = EmailExtractor(use_deobfuscate=False)
Web Crawling API
from email_extractor import EmailExtractor
from email_extractor.crawler import WebCrawler
ex = EmailExtractor()
crawler = WebCrawler(
extractor=ex,
rate_limit=1.0,
respect_robots=True,
)
# Single page
results = crawler.extract_url("https://example.com/contact")
# Recursive crawl
results = crawler.crawl("https://example.com", max_depth=2, max_pages=100)
for r in results:
print(f"{r.email} (found on {r.source})")
Email Deobfuscation
email-harvest automatically detects and decodes common email obfuscation patterns:
| Obfuscated | Decoded |
|---|---|
user [at] domain [dot] com |
user@domain.com |
user (at) domain (dot) com |
user@domain.com |
user {at} domain {dot} com |
user@domain.com |
user AT domain DOT com |
user@domain.com |
user @ domain . com |
user@domain.com |
user%40domain.com |
user@domain.com |
user @ domain . com |
user@domain.com |
Disable with --no-deobfuscate (CLI) or use_deobfuscate=False (API).
Comparison
| Feature | email-harvest | extract-emails | emailfinder |
|---|---|---|---|
| CLI tool | Yes | No | Yes |
| Python API | Yes | Yes | Yes |
| URL extraction | Yes | Yes | Yes |
| Recursive crawling | Yes | No | No |
| Email deobfuscation | Yes | No | No |
| False positive filtering | Yes | No | No |
| Source tracking | Yes | No | No |
| JSON/CSV/JSONL output | Yes | No | No |
| robots.txt compliance | Yes | No | No |
| Rate limiting | Yes | No | No |
| Proxy support | Yes | No | No |
| stdin pipe support | Yes | No | No |
| Zero config | Yes | No | No |
Development
git clone https://github.com/thunderbit/email-harvest.git
cd email-harvest
pip install -e ".[dev]"
pytest
Also by Thunderbit
- phone-harvest — Extract and identify phone numbers from text, files, and web pages
- image-harvest — Discover and batch download images from web pages
- product-harvest — Extract structured product data using Schema.org
License
Built by Thunderbit — AI-powered web scraper and data extraction tools.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file email_extraction-0.1.0.tar.gz.
File metadata
- Download URL: email_extraction-0.1.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ac27449eacba9c53bfe3da22adf4e9db5855b64534756099275aa256c580520
|
|
| MD5 |
80543919428ec27e220707871ea83ae3
|
|
| BLAKE2b-256 |
5eba3dbde221f2e9ea4e98c911a9b9c7fa87816370edad48f5d87b6af772d4f4
|
File details
Details for the file email_extraction-0.1.0-py3-none-any.whl.
File metadata
- Download URL: email_extraction-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51dc88407019bcdef649df3fac7c524874d8c38920f3b563c005c537c8c66ce4
|
|
| MD5 |
2ece55d562ba1b75a82c41c04f25608f
|
|
| BLAKE2b-256 |
e99457461a92f076597684117f5750629136d825a9f3a8ab4af261066be993a4
|