Skip to main content

A python package to simplify web scraping . Built using REgex and Curl

Project description


A simple and lightweight library for scraping the web


Built on Curl and Regex in python , SpiderNet offers similar functionality to the (BeautifulSoup and requests) alternative . For the package to work , you need to have curl installed in your system .

Install the latest version from Pypi or the releases page

pip install SpiderNet

The main class is GenSpider .

from SpiderNet import GenSpider
web=GenSpider(<website>)

The methods are

    • website_text
    • This method returns the markup text of the website .
    • find_all_html_tags
    • This method finds all html tags passed in the parameter. If the tags are nested then upon looping them you can add the 'text' keyword in the function to target the initial looped text .
    • extract_text_from_html
    • This method extracts text from the looped instance of the tag!
    • find_all_tags_by_classname
    • This method finds all html tags passed in the parameter with the given class only , also passed in the parameter. If the tags are nested then upon looping them you can add the 'text' keyword in the function to target the initial looped text.
    • get_href_from_a_tags
    • Returns a list of all href attributes of anchor tag . Optional text parameter if you want to target a particualr text piece. Default is extracting href from the entire page.
    • get_src_from_img_tags
    • Returns a list of all src attributes of img tag . Optional text parameter if you want to target a particualr text piece. Default is extracting src from the entire page.

Example code of extracting Comic Book Chapters from readallcomics , using the new DataTypes , and their respective href attributes

from SpiderNet import HashMap , ForEach , GenSpider , Str


string=Str("https://readallcomics.com/category/chakra-the-invincible/")
web=GenSpider(string)
x=web.find_all_tags_by_classname('ul','list-story')
arr=HashMap()
for d in x:
  
    w=web.find_all_html_tags('a',text=d)
    num=1
    link_content=web.get_href_from_a_tags(text=d)
    for y in range(len(w)):
        text_content = web.extract_text_from_html(w[y])
        
        arr.add(text_content,link_content[y])
        num+=1

ForEach(arr).unit()

The output of the code will be as follows

Chakra The Invincible 010 (2016) => https://readallcomics.com/chakra-the-invincible-010-2016/
Chakra The Invincible 009 (2016) => https://readallcomics.com/chakra-the-invincible-009-2016/
Chakra The Invincible 008 (2016) => https://readallcomics.com/chakra-the-invincible-008-2016/
Chakra The Invincible 007 (2016) => https://readallcomics.com/chakra-the-invincible-007-2016/
Chakra The Invincible 006 (2015) => https://readallcomics.com/chakra-the-invincible-006-2015/
Chakra The Invincible 005 (2015) => https://readallcomics.com/chakra-the-invincible-005-2015/
Chakra The Invincible 004 (2015) => https://readallcomics.com/chakra-the-invincible-004-2015/
Chakra The Invincible 003 (2015) => https://readallcomics.com/chakra-the-invincible-003-2015/
Chakra The Invincible 002 (2015) => https://readallcomics.com/chakra-the-invincible-002-2015/
Chakra The Invincible 001 (2015) => https://readallcomics.com/chakra-the-invincible-001-2015/

For more examples look at :

Project details


Release history Release notifications | RSS feed

This version

1.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidernet-1.3.tar.gz (5.1 kB view hashes)

Uploaded Source

Built Distribution

SpiderNet-1.3-py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page