A fast, zero-config CLI tool to extract email addresses from text, files, and web pages.

These details have not been verified by PyPI

Project links

Project description

email-harvest

A fast, zero-config CLI tool to extract email addresses from text, files, and web pages.

Features

Extract from anywhere — text strings, local files, URLs, or entire websites
Recursive crawling — follow links within a domain up to N levels deep
Email deobfuscation — decodes [at], (at), HTML entities (@), URL encoding (%40), and more
Smart filtering — auto-removes false positives (image filenames, placeholders, test domains)
Domain allow/block lists — include or exclude specific email domains
Source tracking — see which URL or file each email came from
Multiple output formats — plain text, CSV, JSON, JSONL
Polite crawling — rate limiting and robots.txt compliance built-in
Proxy support — route requests through HTTP/HTTPS proxies
Pipe-friendly — works with stdin/stdout for shell pipelines
Python API — use as a library in your own scripts
Minimal dependencies — only requests + beautifulsoup4

Installation

pip install email-harvest

Requires Python 3.8+.

Quick Start

Extract emails from a web page:

emx https://example.com/contact

Extract from a local file:

emx contacts.html

Extract from multiple files using glob patterns:

emx "pages/*.html" --format json

Pipe text through stdin:

curl -s https://example.com | emx -

Crawl an entire website (depth 2, max 100 pages):

emx https://example.com --depth 2 --max-pages 100 -v

CLI Reference

emx [OPTIONS] SOURCE...

Sources are auto-detected:

https://... or http://... → URL mode
- → read from stdin
*.html → glob pattern
anything else → file path

Crawling Options

Flag	Default	Description
`--depth N`	`0`	Recursion depth for URLs (0 = single page)
`--max-pages N`	`50`	Maximum pages to crawl per URL
`--rate-limit SEC`	`1.0`	Seconds between requests
`--ignore-robots`	off	Ignore robots.txt restrictions
`--proxy URL`	—	HTTP/HTTPS proxy
`--timeout SEC`	`10`	Request timeout
`--user-agent STR`	`EmailHarvest/1.0`	Custom User-Agent

Filtering Options

Flag	Description
`--include-domain DOMAIN`	Only keep emails from this domain (repeatable)
`--exclude-domain DOMAIN`	Exclude emails from this domain (repeatable)
`--no-deobfuscate`	Disable email deobfuscation

Output Options

Flag	Default	Description
`--format, -f`	`plain`	Output format: `plain`, `csv`, `json`, `jsonl`
`--output, -o FILE`	stdout	Write to file
`--with-source`	off	Show source URL/file for each email
`--sort`	off	Sort alphabetically
`--count`	off	Print only the count
`-v, --verbose`	off	Show progress on stderr
`-q, --quiet`	off	Suppress all non-output messages

Output Examples

Plain (default):

alice@company.com
bob@company.com

JSON:

emx page.html --format json

[
  "alice@company.com",
  "bob@company.com"
]

CSV with source tracking:

emx https://example.com --format csv --with-source

email,source,source_type
alice@company.com,https://example.com/contact,url
bob@company.com,https://example.com/about,url

JSONL (streaming):

emx page.html --format jsonl

{"email": "alice@company.com"}
{"email": "bob@company.com"}

Python API

from email_extractor import EmailExtractor

ex = EmailExtractor()

# From text
emails = ex.from_text("Contact us at hello@company.org")
# ['hello@company.org']

# From file
emails = ex.from_file("contacts.html")

# From HTML with mailto: detection
emails = ex.from_html("<a href='mailto:hi@co.org'>Email us</a>")

# From glob pattern
emails = ex.from_glob("pages/**/*.html")

# With source tracking
results = ex.extract("hi@co.org", source="input.txt", source_type="file")
for r in results:
    print(f"{r.email} from {r.source}")

# With domain filtering
ex = EmailExtractor(
    include_domains=["company.org"],
    exclude_domains=["spam.org"],
)

# Disable deobfuscation
ex = EmailExtractor(use_deobfuscate=False)

Web Crawling API

from email_extractor import EmailExtractor
from email_extractor.crawler import WebCrawler

ex = EmailExtractor()
crawler = WebCrawler(
    extractor=ex,
    rate_limit=1.0,
    respect_robots=True,
)

# Single page
results = crawler.extract_url("https://example.com/contact")

# Recursive crawl
results = crawler.crawl("https://example.com", max_depth=2, max_pages=100)

for r in results:
    print(f"{r.email} (found on {r.source})")

Email Deobfuscation

email-harvest automatically detects and decodes common email obfuscation patterns:

Obfuscated	Decoded
`user [at] domain [dot] com`	`user@domain.com`
`user (at) domain (dot) com`	`user@domain.com`
`user {at} domain {dot} com`	`user@domain.com`
`user AT domain DOT com`	`user@domain.com`
`user @ domain . com`	`user@domain.com`
`user%40domain.com`	`user@domain.com`
`user @ domain . com`	`user@domain.com`

Disable with --no-deobfuscate (CLI) or use_deobfuscate=False (API).

Comparison

Feature	email-harvest	extract-emails	emailfinder
CLI tool	Yes	No	Yes
Python API	Yes	Yes	Yes
URL extraction	Yes	Yes	Yes
Recursive crawling	Yes	No	No
Email deobfuscation	Yes	No	No
False positive filtering	Yes	No	No
Source tracking	Yes	No	No
JSON/CSV/JSONL output	Yes	No	No
robots.txt compliance	Yes	No	No
Rate limiting	Yes	No	No
Proxy support	Yes	No	No
stdin pipe support	Yes	No	No
Zero config	Yes	No	No

Development

git clone https://github.com/thunderbit/email-harvest.git
cd email-harvest
pip install -e ".[dev]"
pytest

Also by Thunderbit

phone-harvest — Extract and identify phone numbers from text, files, and web pages
image-harvest — Discover and batch download images from web pages
product-harvest — Extract structured product data using Schema.org

License

MIT

Built by Thunderbit — AI-powered web scraper and data extraction tools.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

email_extraction-0.1.0.tar.gz (14.2 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

email_extraction-0.1.0-py3-none-any.whl (16.6 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file email_extraction-0.1.0.tar.gz.

File metadata

Download URL: email_extraction-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for email_extraction-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7ac27449eacba9c53bfe3da22adf4e9db5855b64534756099275aa256c580520`
MD5	`80543919428ec27e220707871ea83ae3`
BLAKE2b-256	`5eba3dbde221f2e9ea4e98c911a9b9c7fa87816370edad48f5d87b6af772d4f4`

See more details on using hashes here.

File details

Details for the file email_extraction-0.1.0-py3-none-any.whl.

File metadata

Download URL: email_extraction-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 16.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for email_extraction-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51dc88407019bcdef649df3fac7c524874d8c38920f3b563c005c537c8c66ce4`
MD5	`2ece55d562ba1b75a82c41c04f25608f`
BLAKE2b-256	`e99457461a92f076597684117f5750629136d825a9f3a8ab4af261066be993a4`

See more details on using hashes here.

email-extraction 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

email-harvest

Features

Installation

Quick Start

CLI Reference

Crawling Options

Filtering Options

Output Options

Output Examples

Python API

Web Crawling API

Email Deobfuscation

Comparison

Development

Also by Thunderbit

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes