Skip to main content

icePick is a All in one Package library for easy Scraping

Project description

IcePick is a All in one Package library for easy Scraping


Concept

  • Lightweight Scraping Library

  • All in one Package library for easy Scraping

Requirements

  • Python 3.4 or later(not support 2.x)

  • MongoDB

Dependencies Libraries

  • aiohttp

  • beautifulsoup4

  • pymongo >= 3.0

  • nose

Usage

Scraping Flow,

Your Scraping Order(Order) -> Do Scraping(Picker) -> HTML Parse(Parser) -> Save in Database(Recorder)

Example

get a my repository filenames

import icePick

db = icePick.get_database('icePick_example', 'localhost')


class GithubRepoParser(icePick.Parser):
    def serialize(self):
        result = {
            "files": [],
        }

        for v in self.bs.find_all(class_="js-directory-link"):
            result['files'] += [v.text]
        return result


class GithubRepoRecorder(icePick.Recorder):
    struct = icePick.Structure(files=list())

    class Meta:
        database = db


class GithubRepoOrder(icePick.Order):
    recorder = GithubRepoRecorder
    parser = GithubRepoParser


def main():
    document = {
        'url': 'https://github.com/teitei-tk/ice-pick/tree/master',
        'ua': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
    }

    print('---download start---')
    order = GithubRepoOrder(document.get('url'), document.get('ua'))
    picker = icePick.Picker([order])
    picker.run()
    print("---finish---")

if __name__ == "__main__":
    main()
>>> import icePick
>>> db = icePick.get_database('icePick_example', 'localhost')
>>> class GithubRepoRecorder(icePick.Recorder):
...     struct = icePick.Structure(files=list())
...     class Meta:
...         database = db
...
>>> records = GithubRepoRecorder.find()
>>> records[0].files
['example', 'icePick', 'tests', 'LICENSE', 'README.md', 'circle.yml', 'requirements.txt']
>>>

TODO

  • Crawling

  • Document

LICENSE

  • MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icePick-0.0.4.1.tar.gz (6.4 kB view details)

Uploaded Source

File details

Details for the file icePick-0.0.4.1.tar.gz.

File metadata

  • Download URL: icePick-0.0.4.1.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for icePick-0.0.4.1.tar.gz
Algorithm Hash digest
SHA256 ba194d238a8ff21cbca9dbb2006f7c71f1cf8781cca9ba0c314adc1637f01d43
MD5 3ac28c9b872b69ae61c1fdbdbbebe76f
BLAKE2b-256 5e65d37842a15a036f164e2712b1ebc98aae0605695481dc0823959ac389bfda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page