Skip to main content

Krawl: A collection of crawlers

Project description

README

API

  • Given a url or a list of urls
get_main_text([url], size=200, min_paragraph_len=10)
  -> CrawlResponse
get_main_text(urls, size=200, min_paragraph_len=10)
  -> CrawlResponse

- The links only contain text
- The images only contain title if exists

get_main_text_as_markdown([url], size=200, min_paragraph_len=10)
get_main_text_as_markdown(urls, size=200, min_paragraph_len=10)

- Each link contains (text, href) in markdown format
- Each image contains (text, href) in markdown format
  • CrawlResponse
first
items: iterate over each urls

Terminology

LAYOUT OF A PAGE

Landing page main content

[title]
[heroline]
[header primary]
[icon url]

Landing page navigation

<SUBITEM>
  item_label
  item_url
  item_description 
{
  item_label
  item_url
  subitems: [
    <SUBITEM>
  ]
}

HTML Node with Coordinate Info { tag classtext text bbpos: { x y w h } nodenr // number of nodes in the html page }

bs4 tree navigation functions

Glossary

  • range: the reachable nodes starting of a node

Functions

  • find_closest_hypersibling(backtrack_depth:int, sibling_type:str)

    go upwards in the tree to an ancestor and search all the range of the ancestor to find the sibling of a certain type. Stop at first encounter.

  • ignore_node(n_text:int=1)

  • find_header_section

  • find_immediate_sibling(sibling_type:str)

Workflow

find_navigation: its outcome is useful to validate the logo/heroline

flowchart TD
    start{{Start}}
    soup((soup))
    url((url))
    save>SAVE RECORD]
    node_features[make_features] 
    features[HTML Context] 
    prediction[outcome]

    subgraph Main
    start --> url
    url --> soup --> make_features --> find_navigation --> find_logo --> find_tagline --> save_html
    find_navigation -.->  find_items --> save
    find_logo -.->  save
    find_tagline -.->  save
    save --> |requires| save_key --> |can use| hash_url
    end


    subgraph Prediction
    node_features --> predictor
    predictor --> predictor_nav --> bool
    predictor --> predictor_tagline --> bool
    predictor --> predictor_logo --> bool
    end

    subgraph Soup utils
    utils --> get_context --> features
    utils --> first_h1
    utils --> first_nav
    utils --> subtree_after
    utils --> backtrack_until_text_sibling
    utils --> backtrack_until_img_sibling
    utils --> is_img
    end

    make_features --> get_context
    find_navigation --> |uses| features
    find_navigation --> |uses| predictor_nav
    find_logo --> |uses| features
    find_logo --> |uses| predictor_logo
    find_tagline --> |uses| features
    find_tagline --> |uses| predictor_tagline
    features --> node_features

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krawl-0.0.6.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

krawl-0.0.6-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file krawl-0.0.6.tar.gz.

File metadata

  • Download URL: krawl-0.0.6.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/21.4.0

File hashes

Hashes for krawl-0.0.6.tar.gz
Algorithm Hash digest
SHA256 c1222c3f9c4b451ad4403be215ea5c4156d5c6583bb49cb258c5385771b20050
MD5 153f89970a30c9eda0c42510e0e2579e
BLAKE2b-256 ac151ee6c0c4df60f358c2010a5ce8a829eac0c0387b875330a1fd56ec10e9ea

See more details on using hashes here.

File details

Details for the file krawl-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: krawl-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 36.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/21.4.0

File hashes

Hashes for krawl-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a6fe5eca812a3f0919ce2c3d24185e7b77df2445ef3dfee68bb639fb5ad09ed8
MD5 b22a0ee9f1ad3a8d884e61c8ed10005a
BLAKE2b-256 fc94bda7f330481367b9c237b0d584a6a3768727e7bc838be02b15deaeab32be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page