Kraken Extract From HTML
Project description
Extract from html
What it does
Extracts the following from html:
- urls
- emails
- images
- tables
- structured data (schema.org)
- text
- title
- feeds
How to use
Using the api
Send a url (get)
Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes
Send a WebContent object (post)
The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.
{
"@type": "webContent",
"url": [
"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
],
"archivedAt": [
"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
],
"about": {
"@type": "webPage",
"url": "https://www.petro-canada.ca/en/business/rack-prices"
}
}
Using the library
Provided url of the page and html content, returns list of records with extractions.
from kraken_extract_from_html import kraken_extract_from_html as k
records = k.get(url, html)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kraken-extract-from-html-0.0.21.tar.gz
.
File metadata
- Download URL: kraken-extract-from-html-0.0.21.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7385d9afcfed3343346a51634bf1a00d15bd9dc5d692fac04ab582261ebbf53 |
|
MD5 | 155d2f085b82925de07e292ce27f62a7 |
|
BLAKE2b-256 | 7857e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303 |
File details
Details for the file kraken_extract_from_html-0.0.21-py3-none-any.whl
.
File metadata
- Download URL: kraken_extract_from_html-0.0.21-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fec384a162812b09a17c3451aee396124d80e43dee31d6c9c3548c540965b4dd |
|
MD5 | 4cc3a8c01d41d701d299c6d8de8a5ee8 |
|
BLAKE2b-256 | 6ca49801c38761c97da184d3ff40e92873e4a6402d16d052d71f54d65bc21f2e |