Skip to main content

An assistant for web crawling

Project description

CrawlFlow

This module is a tool for web scrap handling.

How to use: Importing packages

from CrawlFlow.handler import Handler
from CrawlFlow import types as CrawlTypes

Crawl types if you are getting an API request

self.type=CrawlTypes.API()

and if you are getting an HTML request and using BS4

self.type=CrawlTypes.BS4()

Defining the crawler class

class TestCrawler(Handler):
    def __init__(self, name):
        super().__init__(name)
        self.type=CrawlTypes.API()

    def data2rows(self, request_data):
        '''you scrap code here'''

        tables_rows={'videoList': videoList}
        
        return tables_rows

You need to define a method in the class called "data2rows". in this function you will get a variable called "request_data" that is your desired data (in API, json data/ in BS4, BS4 object)

Output of this function must be a python dictionary in this format:

tables_rows={'table1': table1,
'table2': table2, ...}

tables are list of dicts like below:

[ {'col1': 'col1_data', 'col2': 'col2_data'},
{'col1': 'col1_data', 'col2': 'col2_data'},
...]

Now we use our scrap class

  scraper=TestCrawler('name of your scrap query')
  scraper.request.sleep_time=1.5
  scraper.chk_point_interval=20
  scraper.init()
  url_list=['list of URLs that you want to scrap']
  scraper.run(url_list, resume=False, dynamic=False)
  scraper.export_tables()

Setting Headers, Cookies and Params

#setting manullay:
Handler.request.headers=Headers
Handler.request.cookies=Cookies
Handler.request.params=Params

#Or you can read from json file
Handler.request.read_headers(headers_path)
Handler.request.read_cookies(cookies_path)

Handler.chk_point_interval

Request interval for checkpoint data

Handler.directory

Director for saving the data

Hanlder.run(url_list, resume=False, dynamic=False)

url_list: List ==> list of URLs

resume: False==> new start / True==> continue from last checkpoint

dynamic: False==> standard scraping due to url_list / True==> for cases that have next url keys

Dynamic Scraping:

By setting dynamic = True in Hanlder.run method, you can use dynamic scraping

"url_list" should contian first url that starts scrping

class TestCrawler(Handler):
    def __init__(self, name):
        super().__init__(name)
        self.type=CrawlTypes.API()
 
    def data2rows(self, request_data):
        self.data=request_data
        if request_data['data'][0]['link'] is  None:
            self.vars.done=True
            return {}
        self.vars.next_url = f'https://www.aparat.com/api/fa/v1/video/video/search/text/'+self.get(request_data['data'][0], ['link', 'next']).split('text/')[1]
        
        videoList=self.get(request_data['data'][0], ['video', 'data']) 
        tables_rows={'videoList': videoList}
        return tables_rows


crawler=TestCrawler('aparat')
crawler.request.params={'type_search': 'search'}
crawler.init()
url_list=['https://www.aparat.com/api/fa/v1/video/video/search/text/game']
crawler.run(url_list, resume=False, dynamic=True)
crawler.export_tables()

Note: Hanlder.vars is an object that will be save in every ckeckpoint. so you set your variables here

If you set Hanlder.vars.done=True scraping process will be finished after that.

Defining response status codes for interrupting the process

Handler.request.bypass_status_codes = [204]
#If status codes be 204 request_data in data2rows will be like this:
{'status_code': 204}

Sending to databases

first you should set Handler.DB_type to sqllite/mongo/local befor Handler.init() sqllite: It is default value and will send data to the local sqllite database

mongo: It will send data to data MongoDB. also you need to set Handler.DB_info

Handler.DB_info = {'database': 'myDataBase',
                    'connetion_string': 'MongoDB connetion_string'
                    }

local: It will save data in pickle object. note that this option is for small size data.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CrawlFlow-0.2.2.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

CrawlFlow-0.2.2-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file CrawlFlow-0.2.2.tar.gz.

File metadata

  • Download URL: CrawlFlow-0.2.2.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for CrawlFlow-0.2.2.tar.gz
Algorithm Hash digest
SHA256 72726fda9485199f376109a24761c62d9678e7cd9369c90f018d7d614f330ce8
MD5 f5a027e545d0c5a3fe9027cfed2c0015
BLAKE2b-256 33803e3ecba78ac5f57a14b30a09ac6b21df1e5d3abe41feb6abe204d2df1b45

See more details on using hashes here.

File details

Details for the file CrawlFlow-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: CrawlFlow-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for CrawlFlow-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 051badbf286a9284c7b94cc1b62fd4d0a094c4bf50822dab4155e31d4531692a
MD5 bc439ff9010c589c19a432c1d2b2408d
BLAKE2b-256 afe59ce817bb71825adff8bbf82a23e53ac652ce943ff3e9134fc4dacb2007a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page