Skip to main content

Kraken Extract From HTML

Project description

Extract from html

What it does

Extracts the following from html:

  • urls
  • emails
  • images
  • tables
  • structured data (schema.org)
  • text
  • title
  • feeds

How to use

Using the api

Send a url (get)

Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes

Send a WebContent object (post)

The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

Using the library

Provided url of the page and html content, returns list of records with extractions.

from kraken_extract_from_html import kraken_extract_from_html as k

records = k.get(url, html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kraken-extract-from-html-0.0.18.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file kraken-extract-from-html-0.0.18.tar.gz.

File metadata

File hashes

Hashes for kraken-extract-from-html-0.0.18.tar.gz
Algorithm Hash digest
SHA256 236d5a57930404ad533e99894af41a0b895f531e139a4ae41a2d5e461b6220ae
MD5 01062c16cc0a15981dbac10fa72a9eda
BLAKE2b-256 f239dc21c47ccdb671dbbfc51c17b574340de2a38f028b4223c1fe10df48d11c

See more details on using hashes here.

File details

Details for the file kraken_extract_from_html-0.0.18-py3-none-any.whl.

File metadata

File hashes

Hashes for kraken_extract_from_html-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 0fb1933f391726bdd26a126df745ed5b640dd3b4699f368f4d2a6bc0a724fdb0
MD5 5fd651cbe0f650d7aef2d878337c3a2d
BLAKE2b-256 5a5647f3576866fa8d95eea8011299aa2df6d2ad3376643daa5c1b14d34bd44d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page