Skip to main content

Extract all internal and external links from a URL.

Project description

Links-Extractor

License: GPL v3

Extract all internal and external links from a URL in Python.

Description

Links-Extractor fetches one or more web pages and lists the internal and external hyperlinks found on each page. A link is treated as internal when its host matches the host of the page being scanned, and external otherwise. Empty anchors and javascript:, mailto:, and tel: links are ignored.

Install

pip install links-extractor

This installs the links-extractor command. You can also run the script directly from a clone (python3 extractor.py ...).

Requirements

  • Python 3
  • Dependencies: requests, beautifulsoup4, lxml

Install them with:

pip install -r requirements.txt

Usage

Pass one or more URLs as arguments:

links-extractor https://example.com
python3 extractor.py https://example.com
python3 extractor.py https://example.com https://www.python.org

Redirect the output to a file:

python3 extractor.py https://example.com > out.txt

For each URL the script prints the count and list of internal links followed by the count and list of external links.

A full write-up is available at http://com.puter.tips/2016/12/extract-all-internal-and-external-links.html

You may also find the companion project useful: https://github.com/com-puter-tips/SEO-Analysis

Citation

If you use this software, please cite it using the metadata in CITATION.cff.

License

Distributed under the GNU General Public License v3.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

links_extractor_cli-1.4.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

links_extractor_cli-1.4.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page