extract and repair links from Requests objects, including redirects and final landing page
Project description
extractlinks
extract and repair links from Requests objects, including redirects and final landing page
Installation
pip install extractlinks
python3 -m pip install extractlinks
Usage
import requests
from extractlinks import ExtractLinks
URL = "http://cnn.com/"
r = requests.get(URL, allow_redirects=True)
e = ExtractLinks(content=r)
print(e.json)
Example Output
[
{
"@timestamp": "2021-06-26T16:33:20.384Z",
"url": {
"full": "https://www.cnn.com/",
"original": "https://www.cnn.com/",
"scheme": "https",
"domain": "www.cnn.com",
"path": "/"
},
"http": {
"response": {
"status_code": 200,
"status_code_reason": "OK",
"body_bytes": 1110460
},
"chainitem": 2,
"pguid": "1ff26fce-21a0-401a-9d53-1f863c6e3e31",
"guid": "59dcfa56-b6d2-4924-bae1-70dbcd9d8309"
"count": 324,
"types": [
"a-href",
"form-action",
"link-href",
"meta-content",
"script-src"
],
"tags": [
"script",
"meta",
"a",
"form",
"link"
],
"attributes": [
"action",
"content",
"src",
"href"
],
"links": [
"https://www.cnn.com/specials/cnn-investigates",
"https://www.cnn.com/specials/tech/innovate",
"https://www.cnn.com/travel/news",
"https://www.i.cdn.cnn.com/.a/fonts/cnn/3.9.0/cnnsans-italic.woff2"
...
Objects
# primary list-of-dictionaries / JSON dump
# these contain the full link extractions, including items not recognized as URLs or mobile links
output # list of dictionaries
json # JSON string
# lists
links_all # this only contains full links and any relative links "repaired" back to full-link format (ex. /images becomes https://www.cnn.com/images
types_all # ex. "a-href", "img-src", etc
tags_all # ex. "a", "img"
attributes_all # ex. "href", "src"
# generators, if urlbreakdown module is installed; runs URLBreakdown on every link in links_all
urlbreakdown_generator_dict()
urlbreakdown_generator_json()
Notes
- select URL and HTTP output fields align to the Elastic Common Schema
- links_count is not reflective of a unique count, and includes all objects identified including non-URLs in otherwise link-related tag attributes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
extractlinks-0.1.0.tar.gz
(5.8 kB
view hashes)
Built Distribution
Close
Hashes for extractlinks-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab748682211a32442d0acf0fe4b4e6007a5c2d6b41720bf2c1910fb7c5a3c55f |
|
MD5 | a2b76c947384f57b4d45f4125fad7e4c |
|
BLAKE2b-256 | 6aae19ecfdc93177ad323f0cf2631cd4744392c5803559519a91b886c9d632cb |