minimalist crawling framework.
Project description
Crawling framework for storing text data via sqlite3.
Support for xpath and jsonpath syntax
sqlite3: table_name: some, table_col: col_0, col_1
import vspider def some(url): print(url) x @ url x * '//*[contains(@class,"c-container")]' x ** 'string(./h3/a)' x ** 'string(./h3/a/@href)' for i in range(10): url = f"https://www.baidu.com/s?wd=你好&pn={i*10}" some(url)
sqlite3: table_name: some,some2; table1_col: title,url; table2_col: test
import vspider,vthread @vhread.pool(10) # By using the Vthread function library, the efficiency can be greatly improved. def some(url): print(url) x @ url # The first way of collecting is to use * as the node, ** as the # configuration of the content address collected under the node. # applicable to data of type html_table. x * '//*[contains(@class,"c-container")]' x ** ('title','string(./h3/a)') x ** ('url', 'string(./h3/a/@href)') # The second way of collecting is "directly collecting" by <<. # It is suitable for a single page to collect only one set of data x("some2") @ url x << ("test_int_",'string(//*[@id="page"]/strong/span[2])',lambda i:i[:20]) # setting the storage type with a suffix # Support: # _double_ # _int_ # _integer_ # _str_ # _string_ # _date_ # Both ** and << both configuration functions can use tuple and list to pass parameters. # If the third parameter exists, it will be used as the subsequent processing function of # the data collected by xpath, and the processed data will be inserted into the database. # defualt function: lambda i:i.strip(),if set it None, do nothing. for i in range(10): url = f"https://www.baidu.com/s?wd=你好&pn={i*10}" some(url)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Built Distribution
vspider-0.0.4-py3-none-any.whl
(12.4 kB
view hashes)