Skip to main content

Kraken Extract From HTML

Project description

Extract from html

What it does

Extracts the following from html:

  • urls
  • emails
  • images
  • tables
  • structured data (schema.org)
  • text
  • title
  • feeds

How to use

Using the api

Send a url (get)

Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes

Send a WebContent object (post)

The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

Using the library

Provided url of the page and html content, returns list of records with extractions.

from kraken_extract_from_html import kraken_extract_from_html as k

records = k.get(url, html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kraken-extract-from-html-0.0.17.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file kraken-extract-from-html-0.0.17.tar.gz.

File metadata

File hashes

Hashes for kraken-extract-from-html-0.0.17.tar.gz
Algorithm Hash digest
SHA256 35cbea0a0346a6da0b7108052a36b050e9405aa8fa30f26be1f31093ad7c05bb
MD5 7dae7a2700e43d799c3c03f2559a3eb6
BLAKE2b-256 39eba99164953fbc1ee77f0516c4b49b2a64966349bfb54bf34aa6d573db426e

See more details on using hashes here.

File details

Details for the file kraken_extract_from_html-0.0.17-py3-none-any.whl.

File metadata

File hashes

Hashes for kraken_extract_from_html-0.0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 dd6ae770a0b02655fa1952205ebfc1ae57aa4a28fcb1d38220460ffd6481fc29
MD5 7139b3dbf51a17f5be5e620725f665aa
BLAKE2b-256 5774fb63163ddd40775091ac111113c97e5409ea3899caecf3da0f88ac1b6b8a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page