Krawl: A collection of crawlers
Project description
README
API
- Given a
url
or a list ofurls
get_main_text([url], size=200, min_paragraph_len=10)
-> CrawlResponse
get_main_text(urls, size=200, min_paragraph_len=10)
-> CrawlResponse
- The links only contain text
- The images only contain title if exists
get_main_text_as_markdown([url], size=200, min_paragraph_len=10)
get_main_text_as_markdown(urls, size=200, min_paragraph_len=10)
- Each link contains (text, href) in markdown format
- Each image contains (text, href) in markdown format
- CrawlResponse
first
items: iterate over each urls
Terminology
LAYOUT OF A PAGE
Landing page main content
[title]
[heroline]
[header primary]
[icon url]
Landing page navigation
<SUBITEM>
item_label
item_url
item_description
{
item_label
item_url
subitems: [
<SUBITEM>
]
}
HTML Node with Coordinate Info { tag classtext text bbpos: { x y w h } nodenr // number of nodes in the html page }
bs4 tree navigation functions
Glossary
- range: the reachable nodes starting of a node
Functions
-
find_closest_hypersibling(backtrack_depth:int, sibling_type:str)
go upwards in the tree to an ancestor and search all the range of the ancestor to find the sibling of a certain type. Stop at first encounter.
-
ignore_node(n_text:int=1)
-
find_header_section
-
find_immediate_sibling(sibling_type:str)
Workflow
find_navigation: its outcome is useful to validate the logo/heroline
flowchart TD
start{{Start}}
soup((soup))
url((url))
save>SAVE RECORD]
node_features[make_features]
features[HTML Context]
prediction[outcome]
subgraph Main
start --> url
url --> soup --> make_features --> find_navigation --> find_logo --> find_tagline --> save_html
find_navigation -.-> find_items --> save
find_logo -.-> save
find_tagline -.-> save
save --> |requires| save_key --> |can use| hash_url
end
subgraph Prediction
node_features --> predictor
predictor --> predictor_nav --> bool
predictor --> predictor_tagline --> bool
predictor --> predictor_logo --> bool
end
subgraph Soup utils
utils --> get_context --> features
utils --> first_h1
utils --> first_nav
utils --> subtree_after
utils --> backtrack_until_text_sibling
utils --> backtrack_until_img_sibling
utils --> is_img
end
make_features --> get_context
find_navigation --> |uses| features
find_navigation --> |uses| predictor_nav
find_logo --> |uses| features
find_logo --> |uses| predictor_logo
find_tagline --> |uses| features
find_tagline --> |uses| predictor_tagline
features --> node_features
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file krawl-0.0.6.tar.gz
.
File metadata
- Download URL: krawl-0.0.6.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/21.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1222c3f9c4b451ad4403be215ea5c4156d5c6583bb49cb258c5385771b20050 |
|
MD5 | 153f89970a30c9eda0c42510e0e2579e |
|
BLAKE2b-256 | ac151ee6c0c4df60f358c2010a5ce8a829eac0c0387b875330a1fd56ec10e9ea |
File details
Details for the file krawl-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: krawl-0.0.6-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/21.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6fe5eca812a3f0919ce2c3d24185e7b77df2445ef3dfee68bb639fb5ad09ed8 |
|
MD5 | b22a0ee9f1ad3a8d884e61c8ed10005a |
|
BLAKE2b-256 | fc94bda7f330481367b9c237b0d584a6a3768727e7bc838be02b15deaeab32be |