Skip to main content

icePick is a All in one Package library for easy Scraping

Project description

IcePick is a All in one Package library for easy Scraping


Concept

  • Lightweight Scraping Library

  • All in one Package library for easy Scraping

Requirements

  • Python 3.4 or later(not support 2.x)

  • MongoDB

Dependencies Libraries

  • aiohttp

  • beautifulsoup4

  • pymongo >= 3.0

  • nose

Usage

Scraping Flow,

Your Scraping Order(Order) -> Do Scraping(Picker) -> HTML Parse(Parser) -> Save in Database(Recorder)

Example

get a my repository filenames

import icePick

db = icePick.get_database('icePick_example', 'localhost')


class GithubRepoParser(icePick.Parser):
    def serialize(self):
        result = {
            "files": [],
        }

        for v in self.bs.find_all(class_="js-directory-link"):
            result['files'] += [v.text]
        return result


class GithubRepoRecorder(icePick.Recorder):
    struct = icePick.Structure(files=list())

    class Meta:
        database = db


class GithubRepoOrder(icePick.Order):
    recorder = GithubRepoRecorder
    parser = GithubRepoParser


def main():
    document = {
        'url': 'https://github.com/teitei-tk/ice-pick/tree/master',
        'ua': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
    }

    print('---download start---')
    order = GithubRepoOrder(document.get('url'), document.get('ua'))
    picker = icePick.Picker([order])
    picker.run()
    print("---finish---")

if __name__ == "__main__":
    main()
>>> import icePick
>>> db = icePick.get_database('icePick_example', 'localhost')
>>> class GithubRepoRecorder(icePick.Recorder):
...     struct = icePick.Structure(files=list())
...     class Meta:
...         database = db
...
>>> records = GithubRepoRecorder.find()
>>> records[0].files
['example', 'icePick', 'tests', 'LICENSE', 'README.md', 'circle.yml', 'requirements.txt']
>>>

TODO

  • Crawling

  • Document

LICENSE

  • MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icePick-0.0.1.tar.gz (6.7 kB view details)

Uploaded Source

File details

Details for the file icePick-0.0.1.tar.gz.

File metadata

  • Download URL: icePick-0.0.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for icePick-0.0.1.tar.gz
Algorithm Hash digest
SHA256 95a629c4e14bf00ea329ab2ed7fab4b2642c1fcf9de0d40fbb0b509f7e7e8819
MD5 958bafa3b0439bafa2c7b962fe8557c3
BLAKE2b-256 ff6cea9d489a8dc4f3c41d11e37ad05420116a59b0e7cbd90b4a1dbc2d69489b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page