Skip to main content

minimalist crawling framework.

Project description

Crawling framework for storing text data via sqlite3.

Support for xpath and jsonpath syntax

sqlite3: table_name: some, table_col: col_0, col_1

import vspider

def some(url):
    x @ url
    x * '//*[contains(@class,"c-container")]'
    x ** 'string(./h3/a)'
    x ** 'string(./h3/a/@href)'

for i in range(10):
    url = f"你好&pn={i*10}"

sqlite3: table_name: some,some2; table1_col: title,url; table2_col: test

import vspider,vthread

@vhread.pool(10) # By using the Vthread function library, the efficiency can be greatly improved.
def some(url):
    x @ url
    # The first way of collecting is to use * as the node, ** as the
    # configuration of the content address collected under the node.
    # applicable to data of type html_table.
    x * '//*[contains(@class,"c-container")]'
    x ** ('title','string(./h3/a)')
    x ** ('url',  'string(./h3/a/@href)')

    # The second way of collecting is "directly collecting" by <<.
    # It is suitable for a single page to collect only one set of data
    x("some2") @ url
    x << ("test_int_",'string(//*[@id="page"]/strong/span[2])',lambda i:i[:20])
    # setting the storage type with a suffix
    # Support:
    # _double_
    # _int_
    # _integer_
    # _str_
    # _string_
    # _date_

    # Both ** and << both configuration functions can use tuple and list to pass parameters.
    # If the third parameter exists, it will be used as the subsequent processing function of
    # the data collected by xpath, and the processed data will be inserted into the database.
    # defualt function: lambda i:i.strip(),if set it None, do nothing.

for i in range(10):
    url = f"你好&pn={i*10}"

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for vspider, version 0.0.9
Filename, size File type Python version Upload date Hashes
Filename, size vspider-0.0.9-py3-none-any.whl (15.4 kB) File type Wheel Python version 3.6 Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page