Skip to main content

Kraken Extract From HTML

Project description

Extract from html

What it does

Extracts the following from html:

  • urls
  • emails
  • images
  • tables
  • structured data (schema.org)
  • text
  • title
  • feeds

How to use

Using the api

Send a url (get)

Send the url as a query parameter 'url'. Will retrieve the content and return extracted data. If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes

Send a WebContent object (post)

The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

Using the library

Provided url of the page and html content, returns list of records with extractions.

from kraken_extract_from_html import kraken_extract_from_html as k

records = k.get(url, html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kraken-extract-from-html-0.0.15.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

kraken_extract_from_html-0.0.15-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file kraken-extract-from-html-0.0.15.tar.gz.

File metadata

File hashes

Hashes for kraken-extract-from-html-0.0.15.tar.gz
Algorithm Hash digest
SHA256 2af838fe83320138a83fc6a0fed94eb5aed8119f41d6bdf4b1228f41c46d56e8
MD5 d36c3a301c9172ef5ff94946b1727958
BLAKE2b-256 b13f414f567f745548b539f1b9639508446be67cf50a705c774c426973ec1499

See more details on using hashes here.

File details

Details for the file kraken_extract_from_html-0.0.15-py3-none-any.whl.

File metadata

File hashes

Hashes for kraken_extract_from_html-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 a3070951c86da8af04bf5d0e5828ac7136bcc67305799f4edd7f370f1bbbd714
MD5 8c8cf66a99daf2918bbd50d0438f8c97
BLAKE2b-256 cb78526f6179682cb56bfe007f210a216b365c27aaac00d637d375db338dee06

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page