Skip to main content

Kraken Extract From HTML

Project description

Extract from html

What it does

Extracts the following from html:

  • urls
  • emails
  • images
  • tables
  • structured data (schema.org)
  • text
  • title
  • feeds

How to use

Using the api

Send a url (get)

Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes

Send a WebContent object (post)

The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

Using the library

Provided url of the page and html content, returns list of records with extractions.

from kraken_extract_from_html import kraken_extract_from_html as k

records = k.get(url, html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kraken-extract-from-html-0.0.21.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file kraken-extract-from-html-0.0.21.tar.gz.

File metadata

File hashes

Hashes for kraken-extract-from-html-0.0.21.tar.gz
Algorithm Hash digest
SHA256 a7385d9afcfed3343346a51634bf1a00d15bd9dc5d692fac04ab582261ebbf53
MD5 155d2f085b82925de07e292ce27f62a7
BLAKE2b-256 7857e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303

See more details on using hashes here.

File details

Details for the file kraken_extract_from_html-0.0.21-py3-none-any.whl.

File metadata

File hashes

Hashes for kraken_extract_from_html-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 fec384a162812b09a17c3451aee396124d80e43dee31d6c9c3548c540965b4dd
MD5 4cc3a8c01d41d701d299c6d8de8a5ee8
BLAKE2b-256 6ca49801c38761c97da184d3ff40e92873e4a6402d16d052d71f54d65bc21f2e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page