Skip to main content

A fast, zero-config CLI tool to extract email addresses from text, files, and web pages.

Project description

email-harvest

A fast, zero-config CLI tool to extract email addresses from text, files, and web pages.

Features

  • Extract from anywhere — text strings, local files, URLs, or entire websites
  • Recursive crawling — follow links within a domain up to N levels deep
  • Email deobfuscation — decodes [at], (at), HTML entities (@), URL encoding (%40), and more
  • Smart filtering — auto-removes false positives (image filenames, placeholders, test domains)
  • Domain allow/block lists — include or exclude specific email domains
  • Source tracking — see which URL or file each email came from
  • Multiple output formats — plain text, CSV, JSON, JSONL
  • Polite crawling — rate limiting and robots.txt compliance built-in
  • Proxy support — route requests through HTTP/HTTPS proxies
  • Pipe-friendly — works with stdin/stdout for shell pipelines
  • Python API — use as a library in your own scripts
  • Minimal dependencies — only requests + beautifulsoup4

Installation

pip install email-harvest

Requires Python 3.8+.

Quick Start

Extract emails from a web page:

emx https://example.com/contact

Extract from a local file:

emx contacts.html

Extract from multiple files using glob patterns:

emx "pages/*.html" --format json

Pipe text through stdin:

curl -s https://example.com | emx -

Crawl an entire website (depth 2, max 100 pages):

emx https://example.com --depth 2 --max-pages 100 -v

CLI Reference

emx [OPTIONS] SOURCE...

Sources are auto-detected:

  • https://... or http://... → URL mode
  • - → read from stdin
  • *.html → glob pattern
  • anything else → file path

Crawling Options

Flag Default Description
--depth N 0 Recursion depth for URLs (0 = single page)
--max-pages N 50 Maximum pages to crawl per URL
--rate-limit SEC 1.0 Seconds between requests
--ignore-robots off Ignore robots.txt restrictions
--proxy URL HTTP/HTTPS proxy
--timeout SEC 10 Request timeout
--user-agent STR EmailHarvest/1.0 Custom User-Agent

Filtering Options

Flag Description
--include-domain DOMAIN Only keep emails from this domain (repeatable)
--exclude-domain DOMAIN Exclude emails from this domain (repeatable)
--no-deobfuscate Disable email deobfuscation

Output Options

Flag Default Description
--format, -f plain Output format: plain, csv, json, jsonl
--output, -o FILE stdout Write to file
--with-source off Show source URL/file for each email
--sort off Sort alphabetically
--count off Print only the count
-v, --verbose off Show progress on stderr
-q, --quiet off Suppress all non-output messages

Output Examples

Plain (default):

alice@company.com
bob@company.com

JSON:

emx page.html --format json
[
  "alice@company.com",
  "bob@company.com"
]

CSV with source tracking:

emx https://example.com --format csv --with-source
email,source,source_type
alice@company.com,https://example.com/contact,url
bob@company.com,https://example.com/about,url

JSONL (streaming):

emx page.html --format jsonl
{"email": "alice@company.com"}
{"email": "bob@company.com"}

Python API

from email_extractor import EmailExtractor

ex = EmailExtractor()

# From text
emails = ex.from_text("Contact us at hello@company.org")
# ['hello@company.org']

# From file
emails = ex.from_file("contacts.html")

# From HTML with mailto: detection
emails = ex.from_html("<a href='mailto:hi@co.org'>Email us</a>")

# From glob pattern
emails = ex.from_glob("pages/**/*.html")

# With source tracking
results = ex.extract("hi@co.org", source="input.txt", source_type="file")
for r in results:
    print(f"{r.email} from {r.source}")

# With domain filtering
ex = EmailExtractor(
    include_domains=["company.org"],
    exclude_domains=["spam.org"],
)

# Disable deobfuscation
ex = EmailExtractor(use_deobfuscate=False)

Web Crawling API

from email_extractor import EmailExtractor
from email_extractor.crawler import WebCrawler

ex = EmailExtractor()
crawler = WebCrawler(
    extractor=ex,
    rate_limit=1.0,
    respect_robots=True,
)

# Single page
results = crawler.extract_url("https://example.com/contact")

# Recursive crawl
results = crawler.crawl("https://example.com", max_depth=2, max_pages=100)

for r in results:
    print(f"{r.email} (found on {r.source})")

Email Deobfuscation

email-harvest automatically detects and decodes common email obfuscation patterns:

Obfuscated Decoded
user [at] domain [dot] com user@domain.com
user (at) domain (dot) com user@domain.com
user {at} domain {dot} com user@domain.com
user AT domain DOT com user@domain.com
user &#64; domain &#46; com user@domain.com
user%40domain.com user@domain.com
user @ domain . com user@domain.com

Disable with --no-deobfuscate (CLI) or use_deobfuscate=False (API).

Comparison

Feature email-harvest extract-emails emailfinder
CLI tool Yes No Yes
Python API Yes Yes Yes
URL extraction Yes Yes Yes
Recursive crawling Yes No No
Email deobfuscation Yes No No
False positive filtering Yes No No
Source tracking Yes No No
JSON/CSV/JSONL output Yes No No
robots.txt compliance Yes No No
Rate limiting Yes No No
Proxy support Yes No No
stdin pipe support Yes No No
Zero config Yes No No

Development

git clone https://github.com/thunderbit/email-harvest.git
cd email-harvest
pip install -e ".[dev]"
pytest

Also by Thunderbit

  • phone-harvest — Extract and identify phone numbers from text, files, and web pages
  • image-harvest — Discover and batch download images from web pages
  • product-harvest — Extract structured product data using Schema.org

License

MIT


Built by Thunderbit — AI-powered web scraper and data extraction tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

email_harvest-0.1.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

email_harvest-0.1.0-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file email_harvest-0.1.0.tar.gz.

File metadata

  • Download URL: email_harvest-0.1.0.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for email_harvest-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8d140d86bfa608f095b091973c222473a3e41911e39f91737853b6df01e7000e
MD5 bb02afed305eb2d6f939853d27f1e227
BLAKE2b-256 b70141288e9cee9aaf96141a520c73e2834afe8d4e8ad5918154fa2daf909147

See more details on using hashes here.

File details

Details for the file email_harvest-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: email_harvest-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for email_harvest-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f95a6d5ef6ffbfcfb72d045b665d5dbb1d991784972686958d7dea4c121098d
MD5 47872ef3810fa317c0cf3f4a7e85bfc1
BLAKE2b-256 584a401094982ed2a45e93a5e5b0f9a790b1a8c4e72ea402ab89437356689e0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page