Skip to main content

A python package to simplify web scraping . Built using REgex and Curl

Project description


A simple and lightweight library for scraping the web


Built on Curl and Regex in python , SpiderNet offers similar functionality to the (BeautifulSoup and requests) alternative . For the package to work , you need to have curl installed in your system .

Install the latest version from Pypi or the releases page

pip install SpiderNet

The main class is GenSpider .

from SpiderNet import GenSpider
web=GenSpider(<website>)

The methods are

    • website_text
    • This method returns the markup text of the website .
    • find_all_html_tags
    • This method finds all html tags passed in the parameter. If the tags are nested then upon looping them you can add the 'text' keyword in the function to target the initial looped text .
    • extract_text_from_html
    • This method extracts text from the looped instance of the tag!
    • find_all_tags_by_classname
    • This method finds all html tags passed in the parameter with the given class only , also passed in the parameter. If the tags are nested then upon looping them you can add the 'text' keyword in the function to target the initial looped text.
    • get_href_from_a_tags
    • Returns a list of all href attributes of anchor tag . Optional text parameter if you want to target a particualr text piece. Default is extracting href from the entire page.
    • get_src_from_img_tags
    • Returns a list of all src attributes of img tag . Optional text parameter if you want to target a particualr text piece. Default is extracting src from the entire page.

Example code of extracting Comic Book Chapters from readallcomics , using the new DataTypes , and their respective href attributes

from SpiderNet import HashMap , ForEach , GenSpider , Str


string=Str("https://readallcomics.com/category/chakra-the-invincible/")
web=GenSpider(string)
x=web.find_all_tags_by_classname('ul','list-story')
arr=HashMap()
for d in x:
  
    w=web.find_all_html_tags('a',text=d)
    num=1
    link_content=web.get_href_from_a_tags(text=d)
    for y in range(len(w)):
        text_content = web.extract_text_from_html(w[y])
        
        arr.add(text_content,link_content[y])
        num+=1

ForEach(arr).unit()

The output of the code will be as follows

Chakra The Invincible 010 (2016) => https://readallcomics.com/chakra-the-invincible-010-2016/
Chakra The Invincible 009 (2016) => https://readallcomics.com/chakra-the-invincible-009-2016/
Chakra The Invincible 008 (2016) => https://readallcomics.com/chakra-the-invincible-008-2016/
Chakra The Invincible 007 (2016) => https://readallcomics.com/chakra-the-invincible-007-2016/
Chakra The Invincible 006 (2015) => https://readallcomics.com/chakra-the-invincible-006-2015/
Chakra The Invincible 005 (2015) => https://readallcomics.com/chakra-the-invincible-005-2015/
Chakra The Invincible 004 (2015) => https://readallcomics.com/chakra-the-invincible-004-2015/
Chakra The Invincible 003 (2015) => https://readallcomics.com/chakra-the-invincible-003-2015/
Chakra The Invincible 002 (2015) => https://readallcomics.com/chakra-the-invincible-002-2015/
Chakra The Invincible 001 (2015) => https://readallcomics.com/chakra-the-invincible-001-2015/

For more examples look at :

Project details


Release history Release notifications | RSS feed

This version

1.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidernet-1.3.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

SpiderNet-1.3-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file spidernet-1.3.tar.gz.

File metadata

  • Download URL: spidernet-1.3.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for spidernet-1.3.tar.gz
Algorithm Hash digest
SHA256 de9a9b420bfefecdc40b4ac724069d7b6ac58785b82d3d98611dbef295175fec
MD5 95a07ca580337f0a9042f97c86eeed15
BLAKE2b-256 dd424a35467e7fe30ccb7822cd5ede7826e8447c662ce849b47ff8928ea3c3e1

See more details on using hashes here.

Provenance

File details

Details for the file SpiderNet-1.3-py3-none-any.whl.

File metadata

  • Download URL: SpiderNet-1.3-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for SpiderNet-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f02ff83df917e6d25432d3b8de2e5bffbd6161a78fdb90cb9fd3acded34f06b3
MD5 5eb951a1a67166fc1ddec371c6a5d514
BLAKE2b-256 285b41cb729254a7cf905d4995af3e0fa462222e7eafa3fe8d1a14bc5c8925f0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page