An assistant for web crawling
Project description
CrawlFlow
This module is a tool for web scrap handling.
How to use: Importing packages
from CrawlFlow.handler import Handler
from CrawlFlow import types as CrawlTypes
import json
import pandas as pd
Crawl types if you are getting an API request
self.type=CrawlTypes.API()
and if you are getting an HTML request and using BS4
self.type=CrawlTypes.BS4()
Defining the crawler class
class TestCrawler(Handler):
def __init__(self, name):
super().__init__(name)
self.type=CrawlTypes.API()
def data2rows(self, request_data):
'''you scrap code here'''
tables_rows={'videoList': videoList}
return tables_rows
You need to define a method in the class called "data2rows". in this function you will get a variable called "request_data" that is your desired data (in API, json data/ in BS4, BS4 object)
Output of this function must be a python dictionary in this format:
tables_rows={'table1': table1,
'table2': table2, ...}
tables are list of dicts like below:
[ {'col1': 'col1_data', 'col2': 'col2_data'},
{'col1': 'col1_data', 'col2': 'col2_data'},
...]
Now we use our scrap class
scraper=TestCrawler('name of your scrap query')
scraper.request.sleep_time=1.5
scraper.chk_point_interval=20
scraper.init()
url_list=['list of URLs that you want to scrap']
scraper.run(url_list, resume=False, dynamic=False)
scraper.export_tables()
Setting Headers, Cookies and Params
#setting manullay:
Handler.request.headers=Headers
Handler.request.cookies=Cookies
Handler.request.params=Params
#Or you can read from json file
Handler.request.read_headers(headers_path)
Handler.request.read_cookies(cookies_path)
Handler.chk_point_interval
Request interval for checkpoint data
Handler.directory
Director for saving the data
Hanlder.run(url_list, resume=False, dynamic=False)
url_list: List ==> list of URLs
resume: False==> new start / True==> continue from last checkpoint
dynamic: False==> standard scraping due to url_list / True==> for cases that have next url keys
Dynamic Scraping:
By setting dynamic = True in Hanlder.run method, you can use dynamic scraping
"url_list" should contian first url that starts scrping
class TestCrawler(Handler):
def __init__(self, name):
super().__init__(name)
self.type=CrawlTypes.API()
def data2rows(self, request_data):
self.data=request_data
if request_data['data'][0]['link'] is None:
self.vars.done=True
return {}
self.vars.next_url = f'https://www.aparat.com/api/fa/v1/video/video/search/text/'+self.get(request_data['data'][0], ['link', 'next']).split('text/')[1]
videoList=self.get(request_data['data'][0], ['video', 'data'])
tables_rows={'videoList': videoList}
return tables_rows
crawler=TestCrawler('aparat')
crawler.request.params={'type_search': 'search'}
crawler.init()
url_list=['https://www.aparat.com/api/fa/v1/video/video/search/text/game']
crawler.run(url_list, resume=False, dynamic=True)
crawler.export_tables()
Note: Hanlder.vars is an object that will be save in every ckeckpoint. so you set your variables here
If you set Hanlder.vars.done=True scraping process will be finished after that.
Defining response status codes for interrupting the process
Handler.request.bypass_status_codes = [204]
#If status codes be 204 request_data in data2rows will be like this:
{'status_code': 204}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for CrawlFlow-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd8fb7b931bae22777cd42be3459e1c756fd598ccffb33c7cc64f1d019a6cc5c |
|
MD5 | 8ad76a844a4c7d91ed78a1b46a4212e8 |
|
BLAKE2b-256 | 09f50d295d6d3bb069fd09162cfc20dba4eea6d6eb2c1664205ed76f0442eb88 |