An assistant for web crawling
Project description
CrawlFlow
This module is a tool for web scrap handling.
How to use: Importing packages
from CrawlFlow.handler import Handler
from CrawlFlow import types as CrawlTypes
Crawl types if you are getting an API request
self.type=CrawlTypes.API()
and if you are getting an HTML request and using BS4
self.type=CrawlTypes.BS4()
Defining the crawler class
class TestCrawler(Handler):
def __init__(self, name):
super().__init__(name)
self.type=CrawlTypes.API()
def data2rows(self, request_data):
'''you scrap code here'''
tables_rows={'videoList': videoList}
return tables_rows
You need to define a method in the class called "data2rows". in this function you will get a variable called "request_data" that is your desired data (in API, json data/ in BS4, BS4 object)
Output of this function must be a python dictionary in this format:
tables_rows={'table1': table1,
'table2': table2, ...}
tables are list of dicts like below:
[ {'col1': 'col1_data', 'col2': 'col2_data'},
{'col1': 'col1_data', 'col2': 'col2_data'},
...]
Now we use our scrap class
scraper=TestCrawler('name of your scrap query')
scraper.request.sleep_time=1.5
scraper.chk_point_interval=20
scraper.init()
url_list=['list of URLs that you want to scrap']
scraper.run(url_list, resume=False, dynamic=False)
scraper.export_tables()
Setting Headers, Cookies and Params
#setting manullay:
Handler.request.headers=Headers
Handler.request.cookies=Cookies
Handler.request.params=Params
#Or you can read from json file
Handler.request.read_headers(headers_path)
Handler.request.read_cookies(cookies_path)
Handler.chk_point_interval
Request interval for checkpoint data
Handler.directory
Director for saving the data
Hanlder.run(url_list, resume=False, dynamic=False)
url_list: List ==> list of URLs
resume: False==> new start / True==> continue from last checkpoint
dynamic: False==> standard scraping due to url_list / True==> for cases that have next url keys
Dynamic Scraping:
By setting dynamic = True in Hanlder.run method, you can use dynamic scraping
"url_list" should contian first url that starts scrping
class TestCrawler(Handler):
def __init__(self, name):
super().__init__(name)
self.type=CrawlTypes.API()
def data2rows(self, request_data):
self.data=request_data
if request_data['data'][0]['link'] is None:
self.vars.done=True
return {}
self.vars.next_url = f'https://www.aparat.com/api/fa/v1/video/video/search/text/'+self.get(request_data['data'][0], ['link', 'next']).split('text/')[1]
videoList=self.get(request_data['data'][0], ['video', 'data'])
tables_rows={'videoList': videoList}
return tables_rows
crawler=TestCrawler('aparat')
crawler.request.params={'type_search': 'search'}
crawler.init()
url_list=['https://www.aparat.com/api/fa/v1/video/video/search/text/game']
crawler.run(url_list, resume=False, dynamic=True)
crawler.export_tables()
Note: Hanlder.vars is an object that will be save in every ckeckpoint. so you set your variables here
If you set Hanlder.vars.done=True scraping process will be finished after that.
Defining response status codes for interrupting the process
Handler.request.bypass_status_codes = [204]
#If status codes be 204 request_data in data2rows will be like this:
{'status_code': 204}
Sending to databases
first you should set Handler.DB_type to sqllite/mongo/local befor Handler.init() sqllite: It is default value and will send data to the local sqllite database
mongo: It will send data to data MongoDB. also you need to set Handler.DB_info
Handler.DB_info = {'database': 'myDataBase',
'connetion_string': 'MongoDB connetion_string'
}
local: It will save data in pickle object. note that this option is for small size data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file CrawlFlow-0.2.2.tar.gz
.
File metadata
- Download URL: CrawlFlow-0.2.2.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72726fda9485199f376109a24761c62d9678e7cd9369c90f018d7d614f330ce8 |
|
MD5 | f5a027e545d0c5a3fe9027cfed2c0015 |
|
BLAKE2b-256 | 33803e3ecba78ac5f57a14b30a09ac6b21df1e5d3abe41feb6abe204d2df1b45 |
File details
Details for the file CrawlFlow-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: CrawlFlow-0.2.2-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 051badbf286a9284c7b94cc1b62fd4d0a094c4bf50822dab4155e31d4531692a |
|
MD5 | bc439ff9010c589c19a432c1d2b2408d |
|
BLAKE2b-256 | afe59ce817bb71825adff8bbf82a23e53ac652ce943ff3e9134fc4dacb2007a9 |