extract and repair links from Requests objects, including redirects and final landing page
Project description
extractlinks
extract and repair links from Requests objects, including redirects and final landing page
Installation
pip install extractlinks
python3 -m pip install extractlinks
Usage
import requests
from extractlinks import ExtractLinks
URL = "http://cnn.com/"
r = requests.get(URL, allow_redirects=True)
e = ExtractLinks(content=r)
print(e.json)
Example Output
[
{
"@timestamp": "2021-06-26T16:33:20.384Z",
"url": {
"full": "https://www.cnn.com/",
"original": "https://www.cnn.com/",
"scheme": "https",
"domain": "www.cnn.com",
"path": "/"
},
"http": {
"response": {
"status_code": 200,
"status_code_reason": "OK",
"body_bytes": 1110460
},
"chainitem": 2,
"pguid": "1ff26fce-21a0-401a-9d53-1f863c6e3e31",
"guid": "59dcfa56-b6d2-4924-bae1-70dbcd9d8309"
"count": 324,
"types": [
"a-href",
"form-action",
"link-href",
"meta-content",
"script-src"
],
"tags": [
"script",
"meta",
"a",
"form",
"link"
],
"attributes": [
"action",
"content",
"src",
"href"
],
"links": [
"https://www.cnn.com/specials/cnn-investigates",
"https://www.cnn.com/specials/tech/innovate",
"https://www.cnn.com/travel/news",
"https://www.i.cdn.cnn.com/.a/fonts/cnn/3.9.0/cnnsans-italic.woff2"
...
Objects
# primary list-of-dictionaries / JSON dump
# these contain the full link extractions, including items not recognized as URLs or mobile links
output # list of dictionaries
json # JSON string
# lists
links_all # this only contains full links and any relative links "repaired" back to full-link format (ex. /images becomes https://www.cnn.com/images
types_all # ex. "a-href", "img-src", etc
tags_all # ex. "a", "img"
attributes_all # ex. "href", "src"
# generators, if urlbreakdown module is installed; runs URLBreakdown on every link in links_all
urlbreakdown_generator_dict()
urlbreakdown_generator_json()
Notes
- select URL and HTTP output fields align to the Elastic Common Schema
- links_count is not reflective of a unique count, and includes all objects identified including non-URLs in otherwise link-related tag attributes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
extractlinks-0.1.0.tar.gz
(5.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extractlinks-0.1.0.tar.gz.
File metadata
- Download URL: extractlinks-0.1.0.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc59e04319c010fdcdfb7941518eff8a60e15b561e5b04417ce16d068fe3bdbd
|
|
| MD5 |
3ad809a4bbd194566b20af7954ab8085
|
|
| BLAKE2b-256 |
edea7bf330acc52016592615de3f1d5d898eb3a379c71b6c2075065a9a1f3e1d
|
File details
Details for the file extractlinks-0.1.0-py3-none-any.whl.
File metadata
- Download URL: extractlinks-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab748682211a32442d0acf0fe4b4e6007a5c2d6b41720bf2c1910fb7c5a3c55f
|
|
| MD5 |
a2b76c947384f57b4d45f4125fad7e4c
|
|
| BLAKE2b-256 |
6aae19ecfdc93177ad323f0cf2631cd4744392c5803559519a91b886c9d632cb
|