Skip to main content

Kraken Extract From HTML

Project description

Extract from html

What it does

Extracts the following from html:

  • urls
  • emails
  • images
  • tables
  • structured data (schema.org)
  • text
  • title
  • feeds

How to use

Using the api

Send a url (get)

Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes

Send a WebContent object (post)

The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

Using the library

Provided url of the page and html content, returns list of records with extractions.

from kraken_extract_from_html import kraken_extract_from_html as k

records = k.get(url, html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kraken-extract-from-html-0.0.16.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

kraken_extract_from_html-0.0.16-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file kraken-extract-from-html-0.0.16.tar.gz.

File metadata

File hashes

Hashes for kraken-extract-from-html-0.0.16.tar.gz
Algorithm Hash digest
SHA256 88672faf0bff87895cb8c5f28dee98ed2880ea4ce29e0611b831b1c3622b19b3
MD5 5f3a3bda42cfe74ee3f6b28f61c6b7ec
BLAKE2b-256 1294a8306e4fba4608a28670568feb2e4c7a5c2b6235d465d8e993014b805fdc

See more details on using hashes here.

File details

Details for the file kraken_extract_from_html-0.0.16-py3-none-any.whl.

File metadata

File hashes

Hashes for kraken_extract_from_html-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 37746950ab722fd78b35a8e2bee2cb42c1b83eef526b7f24bb2a7758fb553ba1
MD5 dea777c9596362512bbe1172e9206d46
BLAKE2b-256 8fd72d4867a8ef672134d6e3b14b6a3ad02f376d124e706fa485c8b4457c0bb2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page