CrawlFlow

An assistant for web crawling

These details have not been verified by PyPI

Project links

Homepage

Project description

CrawlFlow

This module is a tool for web scrap handling.

How to use: Importing packages

from CrawlFlow.handler import Handler
from CrawlFlow import types as CrawlTypes
import json
import pandas as pd

Crawl types if you are getting an API request

self.type=CrawlTypes.API()

and if you are getting an HTML request and using BS4

self.type=CrawlTypes.BS4()

Defining the crawler class

class TestCrawler(Handler):
    def __init__(self, name):
        super().__init__(name)
        self.type=CrawlTypes.API()

  def data2rows(self, request_data):
          '''you scrap code here'''

          tables_rows={'videoList': videoList}
          
          return tables_rows

You need to define a method in the class called "data2rows". in this function you will get a variable called "request_data" that is your desired data (in API, json data/ in BS4, BS4 object)

Output of this function must be a python dictionary in this format:

tables_rows={'table1': table1,
'table2': table2, ...}

tables are list of dicts like below:

[ {'col1': 'col1_data', 'col2': 'col2_data'},
{'col1': 'col1_data', 'col2': 'col2_data'},
...]

Now we use our scrap class

  scraper=TestCrawler('name of your scrap query')
  scraper.request.sleep_time=1.5
  scraper.chk_point_interval=20
  scraper.init()
  url_list=['list of URLs that you want to scrap']
  scraper.run(url_list, resume=False, dynamic=False)
  scraper.export_tables()

Setting Headers, Cookies and Params

#setting manullay:
Handler.request.headers=Headers
Handler.request.cookies=Cookies
Handler.request.params=Params

#Or you can read from json file
Handler.request.read_headers(headers_path)
Handler.request.read_cookies(cookies_path)

Handler.chk_point_interval

Request interval for checkpoint data

Handler.directory

Director for saving the data

Hanlder.run(url_list, resume=False, dynamic=False)

url_list: List ==> list of URLs

resume: False==> new start / True==> continue from last checkpoint

dynamic: False==> standard scraping due to url_list / True==> for cases that have next url keys

Dynamic Scraping:

By setting dynamic = True in Hanlder.run method, you can use dynamic scraping

"url_list" should contian first url that starts scrping

class TestCrawler(Handler):
    def __init__(self, name):
        super().__init__(name)
        self.type=CrawlTypes.API()
 
    def data2rows(self, request_data):
        self.data=request_data
        if request_data['data'][0]['link'] is  None:
            self.vars.done=True
            return {}
        self.vars.next_url = f'https://www.aparat.com/api/fa/v1/video/video/search/text/'+self.get(request_data['data'][0], ['link', 'next']).split('text/')[1]
        
        videoList=self.get(request_data['data'][0], ['video', 'data']) 
        tables_rows={'videoList': videoList}
        return tables_rows


crawler=TestCrawler('aparat')
crawler.request.params={'type_search': 'search'}
crawler.init()
url_list=['https://www.aparat.com/api/fa/v1/video/video/search/text/game']
crawler.run(url_list, resume=False, dynamic=True)
crawler.export_tables()

Note: Hanlder.vars is an object that will be save in every ckeckpoint. so you set your variables here

If you set Hanlder.vars.done=True scraping process will be finished after that.

Defining response status codes for interrupting the process

Handler.request.bypass_status_codes = [204]
#If status codes be 204 request_data in data2rows will be like this:
{'status_code': 204}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.2

Feb 27, 2024

0.2.1

Jan 28, 2024

0.2.0

Jan 1, 2024

This version

0.1.2

Oct 23, 2023

0.1.1

Oct 22, 2023

0.1.0

Oct 21, 2023

0.0.9

Oct 17, 2023

0.0.8

Oct 14, 2023

0.0.7

Oct 2, 2023

0.0.6

Sep 27, 2023

0.0.5

Sep 27, 2023

0.0.4

Sep 27, 2023

0.0.3

Aug 21, 2023

0.0.2

Aug 21, 2023

0.0.1

Aug 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CrawlFlow-0.1.2.tar.gz (6.7 kB view hashes)

Uploaded Oct 23, 2023 Source

Built Distribution

CrawlFlow-0.1.2-py3-none-any.whl (7.0 kB view hashes)

Uploaded Oct 23, 2023 Python 3

Hashes for CrawlFlow-0.1.2.tar.gz

Hashes for CrawlFlow-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3a10a61a995df666ad877906dfd8c02139a53b490bc7861ac3afee31a137c203`
MD5	`49861f43345ea859d47e9b1d8665f372`
BLAKE2b-256	`5ade805c29abc2defacb080fa2f942818c9902da9593ec46dc17f974b39934ce`

Hashes for CrawlFlow-0.1.2-py3-none-any.whl

Hashes for CrawlFlow-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd8fb7b931bae22777cd42be3459e1c756fd598ccffb33c7cc64f1d019a6cc5c`
MD5	`8ad76a844a4c7d91ed78a1b46a4212e8`
BLAKE2b-256	`09f50d295d6d3bb069fd09162cfc20dba4eea6d6eb2c1664205ed76f0442eb88`