Skip to main content

A Python tool for scraping email addresses from websites

Project description

PyMailScraper

PyMailScraper is a powerful and easy-to-use Python tool for scraping email addresses from websites. It can crawl multiple pages, respect throttling limits, and save results in a convenient CSV format.

Installation

You can install PyMailScraper using pip:

pip install pymailscraper

Usage

PyMailScraper can be used both as a command-line tool and as a Python library.

Command-line Usage

After installation, you can use PyMailScraper from the command line:

pymailscraper [OPTIONS]

Options:

  • -u, --urls: One or more URLs to scrape. You can provide multiple URLs separated by spaces. Example: pymailscraper -u https://example.com https://another-example.com

  • -f, --file: Path to a file containing URLs (one per line). Example: pymailscraper -f urls.txt

  • -o, --output: Output CSV file path (default: "email_results.csv"). Example: pymailscraper -u https://example.com -o my_results.csv

  • -d, --depth: Maximum depth to crawl (default: 3). Example: pymailscraper -u https://example.com -d 5

  • -p, --pages: Maximum number of pages to crawl per website (default: 100). Example: pymailscraper -u https://example.com -p 50

  • --common-pages-only: Crawl only common pages (default: False). Example: pymailscraper -u https://example.com --common-pages-only

  • --use-common-pages: Use common pages in crawling (default: False). Example: pymailscraper -u https://example.com --use-common-pages

  • --throttle: Delay between requests in seconds (default: 0). Example: pymailscraper -u https://example.com --throttle 1.5

  • --auto-throttle: Automatically adjust throttle on 'Too many requests' responses. Example: pymailscraper -u https://example.com --auto-throttle

  • --max-throttle: Maximum throttle delay in seconds (default: 5). Example: pymailscraper -u https://example.com --auto-throttle --max-throttle 10

Python Library Usage

You can also use PyMailScraper in your Python scripts:

from pymailscraper import EmailScraper

scraper = EmailScraper(
    output_file="results.csv",
    max_depth=3,
    max_pages=100,
    throttle=1.0,
    auto_throttle=True
)

urls = ["https://example.com", "https://another-example.com"]
scraper.run(urls)

Examples

  1. Scrape a single website:
pymailscraper -u https://example.com
  1. Scrape multiple websites:
pymailscraper -u https://example.com https://another-example.com
  1. Scrape websites from a file with custom output and depth:
pymailscraper -f urls.txt -o results.csv -d 5
  1. Use auto-throttling with a maximum of 50 pages per site:
pymailscraper -u https://example.com --auto-throttle -p 50

Output

PyMailScraper saves the results in a CSV file with the following columns:

  • URL: The page where the email was found
  • Email: The email address
  • Name: Any associated name found (if available)

Ethical Usage

Please use this tool responsibly. Always respect the website's terms of service, robots.txt files, and any legal restrictions on scraping. Be mindful of the load you're putting on websites and use throttling when appropriate.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Support

If you encounter any problems or have any questions, please open an issue on the GitHub repository.

Happy scraping!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymailscraper-0.1.0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

pymailscraper-0.1.0-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file pymailscraper-0.1.0.tar.gz.

File metadata

  • Download URL: pymailscraper-0.1.0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for pymailscraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a162d7d65f7a096546f813b3c6b372f27eca06aa2291fd2ec29ef5821e26981d
MD5 8a3d364528c0b09b6303db003320e323
BLAKE2b-256 dc0d36a89b4635d1dc3aeca17a0db8b0b576eed793a74b9e964ae2108a765dfd

See more details on using hashes here.

File details

Details for the file pymailscraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pymailscraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72f0160ea820b718d46f0cf551a024d2c0be50e26c957718423e68082d97934f
MD5 c75e81c5402e099860f748a4946efd99
BLAKE2b-256 1a93962a09e03a8ba0adda9380840c7740a0750e8a8e61157398f64c627b47e5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page