Kraken Extract From HTML
Project description
Extract from html
What it does
Extracts the following from html:
- urls
- emails
- images
- tables
- structured data (schema.org)
- text
- title
- feeds
How to use
Using the api
Send a url (get)
Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes
Send a WebContent object (post)
The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.
{
"@type": "webContent",
"url": [
"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
],
"archivedAt": [
"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
],
"about": {
"@type": "webPage",
"url": "https://www.petro-canada.ca/en/business/rack-prices"
}
}
Using the library
Provided url of the page and html content, returns list of records with extractions.
from kraken_extract_from_html import kraken_extract_from_html as k
records = k.get(url, html)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for kraken-extract-from-html-0.0.16.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88672faf0bff87895cb8c5f28dee98ed2880ea4ce29e0611b831b1c3622b19b3 |
|
MD5 | 5f3a3bda42cfe74ee3f6b28f61c6b7ec |
|
BLAKE2b-256 | 1294a8306e4fba4608a28670568feb2e4c7a5c2b6235d465d8e993014b805fdc |
Close
Hashes for kraken_extract_from_html-0.0.16-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37746950ab722fd78b35a8e2bee2cb42c1b83eef526b7f24bb2a7758fb553ba1 |
|
MD5 | dea777c9596362512bbe1172e9206d46 |
|
BLAKE2b-256 | 8fd72d4867a8ef672134d6e3b14b6a3ad02f376d124e706fa485c8b4457c0bb2 |