Skip to main content

Kraken Extract From HTML

Project description

Extract from html

What it does

Extracts the following from html:

  • urls
  • emails
  • images
  • tables
  • structured data (schema.org)
  • text
  • title
  • feeds

How to use

Using the api

Send a url (get)

Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes

Send a WebContent object (post)

The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

Using the library

Provided url of the page and html content, returns list of records with extractions.

from kraken_extract_from_html import kraken_extract_from_html as k

records = k.get(url, html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kraken-extract-from-html-0.0.20.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file kraken-extract-from-html-0.0.20.tar.gz.

File metadata

File hashes

Hashes for kraken-extract-from-html-0.0.20.tar.gz
Algorithm Hash digest
SHA256 da6966e1fd64bf8093bca45d027e6a0abc21613cefaec03b57da3602633ab696
MD5 8cfba827c421d067bf1e92d9c2285708
BLAKE2b-256 d62d3fb67c205e2c0f9b4b96bf7792e080a0748a0bccafe24ba4e1f9bd32aa82

See more details on using hashes here.

File details

Details for the file kraken_extract_from_html-0.0.20-py3-none-any.whl.

File metadata

File hashes

Hashes for kraken_extract_from_html-0.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 e77fb8325d1f45767e1909098a41a499046b83df0e365f203e80b46060bf5afd
MD5 ebfe816ef9bd80abd19f368d142cb3c6
BLAKE2b-256 c40cc57789411bd60063714a38e5fe868dd9450e2085aa53e55bfb11ccb721b8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page