Project description

README

API

Given a url or a list of urls

get_main_text([url], size=200, min_paragraph_len=10)
  -> CrawlResponse
get_main_text(urls, size=200, min_paragraph_len=10)
  -> CrawlResponse

- The links only contain text
- The images only contain title if exists

get_main_text_as_markdown([url], size=200, min_paragraph_len=10)
get_main_text_as_markdown(urls, size=200, min_paragraph_len=10)

- Each link contains (text, href) in markdown format
- Each image contains (text, href) in markdown format

CrawlResponse

first
items: iterate over each urls

Terminology

LAYOUT OF A PAGE

Landing page main content

[title]
[heroline]
[header primary]
[icon url]

Landing page navigation

<SUBITEM>
  item_label
  item_url
  item_description 
{
  item_label
  item_url
  subitems: [
    <SUBITEM>
  ]
}

HTML Node with Coordinate Info { tag classtext text bbpos: { x y w h } nodenr // number of nodes in the html page }

bs4 tree navigation functions

Glossary

range: the reachable nodes starting of a node

Functions

find_closest_hypersibling(backtrack_depth:int, sibling_type:str)

go upwards in the tree to an ancestor and search all the range of the ancestor to find the sibling of a certain type. Stop at first encounter.
ignore_node(n_text:int=1)
find_header_section
find_immediate_sibling(sibling_type:str)

Workflow

find_navigation: its outcome is useful to validate the logo/heroline

flowchart TD
    start{{Start}}
    soup((soup))
    url((url))
    save>SAVE RECORD]
    node_features[make_features] 
    features[HTML Context] 
    prediction[outcome]

    subgraph Main
    start --> url
    url --> soup --> make_features --> find_navigation --> find_logo --> find_tagline --> save_html
    find_navigation -.->  find_items --> save
    find_logo -.->  save
    find_tagline -.->  save
    save --> |requires| save_key --> |can use| hash_url
    end


    subgraph Prediction
    node_features --> predictor
    predictor --> predictor_nav --> bool
    predictor --> predictor_tagline --> bool
    predictor --> predictor_logo --> bool
    end

    subgraph Soup utils
    utils --> get_context --> features
    utils --> first_h1
    utils --> first_nav
    utils --> subtree_after
    utils --> backtrack_until_text_sibling
    utils --> backtrack_until_img_sibling
    utils --> is_img
    end

    make_features --> get_context
    find_navigation --> |uses| features
    find_navigation --> |uses| predictor_nav
    find_logo --> |uses| features
    find_logo --> |uses| predictor_logo
    find_tagline --> |uses| features
    find_tagline --> |uses| predictor_tagline
    features --> node_features

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.6

Oct 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krawl-0.0.6.tar.gz (25.0 kB view details)

Uploaded Oct 20, 2024 Source

Built Distribution

krawl-0.0.6-py3-none-any.whl (36.3 kB view details)

Uploaded Oct 20, 2024 Python 3

File details

Details for the file krawl-0.0.6.tar.gz.

File metadata

Download URL: krawl-0.0.6.tar.gz
Upload date: Oct 20, 2024
Size: 25.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/21.4.0

File hashes

Hashes for krawl-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`c1222c3f9c4b451ad4403be215ea5c4156d5c6583bb49cb258c5385771b20050`
MD5	`153f89970a30c9eda0c42510e0e2579e`
BLAKE2b-256	`ac151ee6c0c4df60f358c2010a5ce8a829eac0c0387b875330a1fd56ec10e9ea`

See more details on using hashes here.

File details

Details for the file krawl-0.0.6-py3-none-any.whl.

File metadata

Download URL: krawl-0.0.6-py3-none-any.whl
Upload date: Oct 20, 2024
Size: 36.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/21.4.0

File hashes

Hashes for krawl-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6fe5eca812a3f0919ce2c3d24185e7b77df2445ef3dfee68bb639fb5ad09ed8`
MD5	`b22a0ee9f1ad3a8d884e61c8ed10005a`
BLAKE2b-256	`fc94bda7f330481367b9c237b0d584a6a3768727e7bc838be02b15deaeab32be`