Skip to main content

A fast CLI tool to extract, validate, and identify phone numbers from text, files, and web pages.

Project description

phone-harvest

A fast CLI tool to extract, validate, and identify phone numbers from text, files, and web pages.

Features

  • Extract from anywhere — text strings, local files, URLs, or entire websites
  • International support — recognizes phone numbers from 200+ countries using Google's libphonenumber
  • Auto-identify — detects country, city/region, carrier, and number type (mobile/landline/toll-free/VoIP)
  • E.164 normalization — outputs all numbers in standardized +15551234567 format
  • Number deobfuscation — handles dot-separated (555.123.4567), fullwidth digits, unicode dashes, HTML entities
  • Smart validation — rejects fake/invalid numbers using Google's phone number database
  • Recursive crawling — follow links within a domain up to N levels deep
  • Polite crawling — rate limiting and robots.txt compliance built-in
  • Country/type filtering — include or exclude by country code or number type
  • Multiple output formats — plain text, CSV, JSON, JSONL with full metadata
  • Proxy support — route requests through HTTP/HTTPS proxies
  • Pipe-friendly — works with stdin/stdout for shell pipelines
  • Python API — use as a library in your own scripts

Installation

pip install phone-harvest

Requires Python 3.8+. Only 3 dependencies: requests, beautifulsoup4, phonenumbers.

Quick Start

Extract phone numbers from a web page:

phonex https://example.com/contact

With full details (country, carrier, type):

phonex https://example.com/contact --detail

Output:

+12125551234  US  FIXED_LINE_OR_MOBILE  New York, NY  stdin
+442079460958  GB  FIXED_LINE  London  stdin

Extract from a local file:

phonex contacts.html --format json --detail
[
  {
    "number": "+12125551234",
    "national": "(212) 555-1234",
    "international": "+1 212-555-1234",
    "country_code": "US",
    "country_name": "New York, NY",
    "carrier": "",
    "type": "FIXED_LINE_OR_MOBILE",
    "source": "contacts.html"
  }
]

Pipe text through stdin:

echo "Call +44 20 7946 0958 or +1 212-555-1234" | phonex -

Crawl an entire website:

phonex https://example.com --depth 2 --max-pages 100 -v

CLI Reference

phonex [OPTIONS] SOURCE...

Region

Flag Default Description
--region, -r CODE US Default country for local numbers (ISO 3166-1)

Crawling Options

Flag Default Description
--depth N 0 Recursion depth for URLs (0 = single page)
--max-pages N 50 Maximum pages to crawl per URL
--rate-limit SEC 1.0 Seconds between requests
--ignore-robots off Ignore robots.txt restrictions
--proxy URL HTTP/HTTPS proxy
--timeout SEC 10 Request timeout

Filtering Options

Flag Description
--include-country CODE Only keep numbers from this country (repeatable)
--exclude-country CODE Exclude numbers from this country (repeatable)
--type TYPE Only keep: MOBILE, FIXED_LINE, TOLL_FREE, VOIP (repeatable)
--no-deobfuscate Disable number deobfuscation

Output Options

Flag Default Description
--format, -f plain Output format: plain, csv, json, jsonl
--output, -o FILE stdout Write to file
--detail, -d off Include country, carrier, type, source
--national off Output in national format instead of E.164
--sort off Sort alphabetically
--count off Print only the count

Python API

from phone_extractor import PhoneExtractor

ex = PhoneExtractor(default_region="US")

# From text
results = ex.from_text("Call +1 (212) 555-1234 or +44 20 7946 0958")
for r in results:
    print(f"{r.number} | {r.country_code} | {r.country_name} | {r.number_type}")
    # +12125551234 | US | New York, NY | FIXED_LINE_OR_MOBILE
    # +442079460958 | GB | London | FIXED_LINE

# Just the numbers
numbers = ex.extract_simple("Call +1 212-555-1234")
# ['+12125551234']

# From file
results = ex.from_file("contacts.html")

# From HTML with tel: link detection
results = ex.from_html("<a href='tel:+12125551234'>Call</a>")

# With country filtering
ex = PhoneExtractor(
    default_region="US",
    include_countries=["US", "GB"],
    exclude_countries=["RU"],
    include_types=["MOBILE"],
)

Web Crawling API

from phone_extractor import PhoneExtractor
from phone_extractor.crawler import WebCrawler

ex = PhoneExtractor(default_region="US")
crawler = WebCrawler(ex, rate_limit=1.0, respect_robots=True)

# Single page
results = crawler.extract_url("https://example.com/contact")

# Recursive crawl
results = crawler.crawl("https://example.com", max_depth=2, max_pages=100)

for r in results:
    print(f"{r.number} ({r.country_name}) found on {r.source}")

Phone Number Deobfuscation

phone-harvest automatically normalizes common obfuscation patterns:

Obfuscated Normalized
555.123.4567 555-123-4567
555 dash 123 dash 4567 555-123-4567
&#43;1 212 555 1234 +1 212 555 1234
%2B1 212 555 1234 +1 212 555 1234
Fullwidth digits 555 555
Unicode dashes 555–123–4567 555-123-4567
Non-breaking spaces Regular spaces

Comparison

Feature phone-harvest python-phonenumbers PhoneInfoga CommonRegex
CLI tool Yes No Yes (Go) No
Python API Yes Yes No Yes
Web page extraction Yes No No No
Recursive crawling Yes No No No
International support 200+ countries 200+ countries Yes US-only
Country detection Yes Yes Yes No
Carrier lookup Yes Yes Yes No
Number type (mobile/landline) Yes Yes Yes No
E.164 normalization Yes Yes N/A No
Number validation Yes Yes Yes No
Deobfuscation Yes No No No
robots.txt compliance Yes N/A N/A N/A
JSON/CSV output Yes No JSON No
stdin pipe support Yes No No No
pip install Yes Yes No Yes

Development

git clone https://github.com/thunderbit/phone-harvest.git
cd phone-harvest
pip install -e ".[dev]"
pytest

Also by Thunderbit

  • email-harvest — Extract email addresses from text, files, and web pages
  • image-harvest — Discover and batch download images from web pages
  • product-harvest — Extract structured product data using Schema.org

License

MIT


Built by Thunderbit — AI-powered web scraper and data extraction tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_phone-0.1.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract_phone-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file extract_phone-0.1.0.tar.gz.

File metadata

  • Download URL: extract_phone-0.1.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for extract_phone-0.1.0.tar.gz
Algorithm Hash digest
SHA256 be5deeda41e9ea63ca7070ce87c841c525a1ecc45d0eb557586fba0ae0d600f0
MD5 0a78592df7e0bf5ae0b6a543904578e2
BLAKE2b-256 2fa603f6a9022168c5cff11023a390cad884dcdc97c6755755508b66c0b53541

See more details on using hashes here.

File details

Details for the file extract_phone-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: extract_phone-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for extract_phone-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be7539f97e2888a327d4aac2b0183694f98777dded71019ca563ea93b741fd6e
MD5 b10c40d6ebc8d81d5050961941588076
BLAKE2b-256 d37a5b3ee9e4b4b9b2968914f512d306ffd9c1f157ec4c529a0c0025be04ac28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page