Skip to main content

extract and repair links from Requests objects, including redirects and final landing page

Project description

extractlinks

extract and repair links from Requests objects, including redirects and final landing page

Installation

pip install extractlinks
python3 -m pip install extractlinks

Usage

import requests
from extractlinks import ExtractLinks
URL = "http://cnn.com/"
r = requests.get(URL, allow_redirects=True)
e = ExtractLinks(content=r)
print(e.json)

Example Output

[
	{
		"@timestamp": "2021-06-26T16:33:20.384Z",
		"url": {
			"full": "https://www.cnn.com/",
			"original": "https://www.cnn.com/",
			"scheme": "https",
			"domain": "www.cnn.com",
			"path": "/"
		},
		"http": {
			"response": {
			"status_code": 200,
			"status_code_reason": "OK",
			"body_bytes": 1110460
		},
		"chainitem": 2,
		"pguid": "1ff26fce-21a0-401a-9d53-1f863c6e3e31",
		"guid": "59dcfa56-b6d2-4924-bae1-70dbcd9d8309"
		"count": 324,
		"types": [
			"a-href",
			"form-action",
			"link-href",
			"meta-content",
			"script-src"
		],
		"tags": [
			"script",
			"meta",
			"a",
			"form",
			"link"
		],
		"attributes": [
			"action",
			"content",
			"src",
			"href"
		],
		"links": [
			"https://www.cnn.com/specials/cnn-investigates",
			"https://www.cnn.com/specials/tech/innovate",
			"https://www.cnn.com/travel/news",
			"https://www.i.cdn.cnn.com/.a/fonts/cnn/3.9.0/cnnsans-italic.woff2"
		...

Objects

# primary list-of-dictionaries / JSON dump
# these contain the full link extractions, including items not recognized as URLs or mobile links
output # list of dictionaries
json # JSON string

# lists
links_all # this only contains full links and any relative links "repaired" back to full-link format (ex. /images becomes https://www.cnn.com/images
types_all # ex. "a-href", "img-src", etc
tags_all # ex. "a", "img"
attributes_all # ex. "href", "src"

# generators, if urlbreakdown module is installed; runs URLBreakdown on every link in links_all
urlbreakdown_generator_dict()
urlbreakdown_generator_json()

Notes

  • select URL and HTTP output fields align to the Elastic Common Schema
  • links_count is not reflective of a unique count, and includes all objects identified including non-URLs in otherwise link-related tag attributes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractlinks-0.1.0.tar.gz (5.8 kB view hashes)

Uploaded Source

Built Distribution

extractlinks-0.1.0-py3-none-any.whl (6.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page