A python package to simplify web scraping . Built using REgex and Curl
Project description
A simple and lightweight library for scraping the web
Built on Curl and Regex in python , SpiderNet offers similar functionality to the (BeautifulSoup and requests) alternative . For the package to work , you need to have curl installed in your system .
Install the latest version from Pypi or the releases page
pip install SpiderNet
- Features
- Scrape tags from websites
- Scrape the text within the tags
- Obtain href attributes for the tag (anchor tag)
- Obtain src attribute for the tag (image tag)
- The package contains new Datatypes made for easier workflow which integrate with the parameters and values of the package.
The main class is GenSpider
.
from SpiderNet import GenSpider
web=GenSpider(<website>)
The methods are
website_text
This method returns the markup text of the website . find_all_html_tags
This method finds all html tags passed in the parameter. If the tags are nested then
upon looping them you can add the 'text' keyword in the function to target the initial looped text . extract_text_from_html
This method extracts text from the looped instance of the tag! find_all_tags_by_classname
This method finds all html tags passed in the parameter with the given class only , also passed in the parameter. If the tags are nested then
upon looping them you can add the 'text' keyword in the function to target the initial looped text. get_href_from_a_tags
Returns a list of all href attributes of anchor tag . Optional text parameter if you want to target a particualr text piece. Default is extracting href from the entire page.get_src_from_img_tags
Returns a list of all src attributes of img tag . Optional text parameter if you want to target a particualr text piece. Default is extracting src from the entire page.
Example code of extracting Comic Book Chapters from readallcomics , using the new DataTypes , and their respective href attributes
from SpiderNet import HashMap , ForEach , GenSpider , Str
string=Str("https://readallcomics.com/category/chakra-the-invincible/")
web=GenSpider(string)
x=web.find_all_tags_by_classname('ul','list-story')
arr=HashMap()
for d in x:
w=web.find_all_html_tags('a',text=d)
num=1
link_content=web.get_href_from_a_tags(text=d)
for y in range(len(w)):
text_content = web.extract_text_from_html(w[y])
arr.add(text_content,link_content[y])
num+=1
ForEach(arr).unit()
The output of the code will be as follows
Chakra The Invincible 010 (2016) => https://readallcomics.com/chakra-the-invincible-010-2016/
Chakra The Invincible 009 (2016) => https://readallcomics.com/chakra-the-invincible-009-2016/
Chakra The Invincible 008 (2016) => https://readallcomics.com/chakra-the-invincible-008-2016/
Chakra The Invincible 007 (2016) => https://readallcomics.com/chakra-the-invincible-007-2016/
Chakra The Invincible 006 (2015) => https://readallcomics.com/chakra-the-invincible-006-2015/
Chakra The Invincible 005 (2015) => https://readallcomics.com/chakra-the-invincible-005-2015/
Chakra The Invincible 004 (2015) => https://readallcomics.com/chakra-the-invincible-004-2015/
Chakra The Invincible 003 (2015) => https://readallcomics.com/chakra-the-invincible-003-2015/
Chakra The Invincible 002 (2015) => https://readallcomics.com/chakra-the-invincible-002-2015/
Chakra The Invincible 001 (2015) => https://readallcomics.com/chakra-the-invincible-001-2015/
For more examples look at :
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
spidernet-1.3.tar.gz
(5.1 kB
view details)
Built Distribution
File details
Details for the file spidernet-1.3.tar.gz
.
File metadata
- Download URL: spidernet-1.3.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de9a9b420bfefecdc40b4ac724069d7b6ac58785b82d3d98611dbef295175fec |
|
MD5 | 95a07ca580337f0a9042f97c86eeed15 |
|
BLAKE2b-256 | dd424a35467e7fe30ccb7822cd5ede7826e8447c662ce849b47ff8928ea3c3e1 |
Provenance
File details
Details for the file SpiderNet-1.3-py3-none-any.whl
.
File metadata
- Download URL: SpiderNet-1.3-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f02ff83df917e6d25432d3b8de2e5bffbd6161a78fdb90cb9fd3acded34f06b3 |
|
MD5 | 5eb951a1a67166fc1ddec371c6a5d514 |
|
BLAKE2b-256 | 285b41cb729254a7cf905d4995af3e0fa462222e7eafa3fe8d1a14bc5c8925f0 |