Skip to main content

minimalist crawling framework.

Project description

Crawling framework for storing text data via sqlite3.

Support for xpath and jsonpath syntax

sqlite3: table_name: some, table_col: col_0, col_1

import vspider

def some(url):
    x @ url
    x * '//*[contains(@class,"c-container")]'
    x ** 'string(./h3/a)'
    x ** 'string(./h3/a/@href)'

for i in range(10):
    url = f"你好&pn={i*10}"

sqlite3: table_name: some,some2; table1_col: title,url; table2_col: test

import vspider,vthread

@vhread.pool(10) # By using the Vthread function library, the efficiency can be greatly improved.
def some(url):
    x @ url
    # The first way of collecting is to use * as the node, ** as the
    # configuration of the content address collected under the node.
    # applicable to data of type html_table.
    x * '//*[contains(@class,"c-container")]'
    x ** ('title','string(./h3/a)')
    x ** ('url',  'string(./h3/a/@href)')

    # The second way of collecting is "directly collecting" by <<.
    # It is suitable for a single page to collect only one set of data
    x("some2") @ url
    x << ("test_int_",'string(//*[@id="page"]/strong/span[2])',lambda i:i[:20])
    # setting the storage type with a suffix
    # Support:
    # _double_
    # _int_
    # _integer_
    # _str_
    # _string_
    # _date_

    # Both ** and << both configuration functions can use tuple and list to pass parameters.
    # If the third parameter exists, it will be used as the subsequent processing function of
    # the data collected by xpath, and the processed data will be inserted into the database.
    # defualt function: lambda i:i.strip(),if set it None, do nothing.

for i in range(10):
    url = f"你好&pn={i*10}"

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release. See tutorial on generating distribution archives.

Built Distribution

vspider-0.0.9-py3-none-any.whl (15.4 kB view hashes)

Uploaded 3 6

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page