Skip to main content

A universal solution for web crawling lists

Project description

crawlist

crawlist

A universal solution for web crawling lists

pypi python GitHub stars

introduction

You can use crawlist to crawl websites containing lists, and with some simple configurations, you can obtain all the list data.

installing

You can use pip or pip3 to install the crawlist

pip install crawlist or pip3 install crawlist

If you have already installed crawlist, you may need to update to the latest version

pip install --upgrade crawlist

quickly start

This is a static website demo. It does not use the JavaScript to load the data.

import crawlist as cl

if __name__ == '__main__':
    # Initialize a pager to implement page flipping 
    pager = cl.StaticRedirectPager(uri="https://www.douban.com/doulist/893264/?start=0&sort=seq&playable=0&sub_type=",
                                   uri_split="https://www.douban.com/doulist/893264/?start=%v&sort=seq&playable=0&sub_type=",
                                   start=0,
                                   offset=25) 
    
    # Initialize a selector to select the list element
    selector = cl.CssSelector(pattern=".doulist-item")
    
    # Initialize an analyzer to achieve linkage between pagers and selectors
    analyzer = cl.AnalyzerPrettify(pager, selector)
    res = []
    limit = 100
    # Iterating a certain number of results from the analyzer
    for tr in analyzer(limit): 
        print(tr)
        res.append(tr)
    # If all the data has been collected, the length of the result will be less than the limit
    print(len(res))

This is a dynamic website demo. It uses the JavaScript to load the data.So we need to load a selenium webdriver to drive the JavaScript.

import crawlist as cl

if __name__ == '__main__':
    # Initialize a pager to implement page flipping 
    pager = cl.DynamicScrollPager(uri="https://ec.ltn.com.tw/list/international")
    
    # Initialize a selector to select the list element
    selector = cl.CssSelector(pattern="#ec > div.content > section > div.whitecon.boxTitle.boxText > ul > li")
    
    # Initialize an analyzer to achieve linkage between pagers and selectors
    analyzer = cl.AnalyzerPrettify(pager=pager, selector=selector)
    res = []
    
    # Iterating a certain number of results from the analyzer
    for tr in analyzer(100):
        print(tr)
        res.append(tr)
    print(len(res))
    # After completion, you need to close the webdriver, otherwise it will occupy your memory resources
    pager.webdriver.quit()

Documenting

If you are interested and would like to see more detailed documentation, please click on the link below.

中文|English

Contributing

Please submit pull requests to the develop branch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlist-0.1.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

crawlist-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file crawlist-0.1.0.tar.gz.

File metadata

  • Download URL: crawlist-0.1.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for crawlist-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a9ccae797c412c8c7ca9915e051e3c2e3713da94560fc21e7ca921e5177735d1
MD5 4a5a4b2cc289bdc98bbaf2fa69097428
BLAKE2b-256 0febaffbf264f1da7fa2f0abf679be95c1bc7c7b79dcd77619f3a6ba88d485c3

See more details on using hashes here.

File details

Details for the file crawlist-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: crawlist-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for crawlist-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb2d9d8f58cd62d84b99f95b256cbfd01d5b3cd4cca18455e914fec71466ce4d
MD5 97e6678a9474f7646a59bf4f086a3fb3
BLAKE2b-256 91c3c9275c40bcfe08278707249e026b8bfca359eaf50cf4b7113e556a3a34e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page