Skip to main content

Kraken Extract From HTML

Project description

Extract from html

What it does

Extracts the following from html:

  • urls
  • emails
  • images
  • tables
  • structured data (schema.org)
  • text
  • title
  • feeds

How to use

Using the api

Send a url (get)

Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes

Send a WebContent object (post)

The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

Using the library

Provided url of the page and html content, returns list of records with extractions.

from kraken_extract_from_html import kraken_extract_from_html as k

records = k.get(url, html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kraken-extract-from-html-0.0.19.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file kraken-extract-from-html-0.0.19.tar.gz.

File metadata

File hashes

Hashes for kraken-extract-from-html-0.0.19.tar.gz
Algorithm Hash digest
SHA256 9187ae61de8dd055d8b894271917745b2aeb220c496a4738688c7f46d37fabd3
MD5 a7a442916f6cf2c5423e9e2fe95eba10
BLAKE2b-256 b596b749e2a77405c80c43d40809582b45d41c8521803aba34d65d6286b6ab11

See more details on using hashes here.

File details

Details for the file kraken_extract_from_html-0.0.19-py3-none-any.whl.

File metadata

File hashes

Hashes for kraken_extract_from_html-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 26639d2974d50c042db2ee1021462b4282434c37574b01019cda4b3c6c60b39e
MD5 b485ab22cc9b6507e86a6af18af07731
BLAKE2b-256 a1271a4c353f9a1f77c0e406c42d20de00b3b2fffcc550ed34f103b5e6bfdbe8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page