icePick is a All in one Package library for easy Scraping
Project description
IcePick is a All in one Package library for easy Scraping
Concept
Lightweight Scraping Library
All in one Package library for easy Scraping
Requirements
Python 3.4 or later(not support 2.x)
MongoDB
Dependencies Libraries
aiohttp
beautifulsoup4
pymongo >= 3.0
nose
Usage
Scraping Flow,
Your Scraping Order(Order) -> Do Scraping(Picker) -> HTML Parse(Parser) -> Save in Database(Recorder)
Example
get a my repository filenames
import icePick
db = icePick.get_database('icePick_example', 'localhost')
class GithubRepoParser(icePick.Parser):
def serialize(self):
result = {
"files": [],
}
for v in self.bs.find_all(class_="js-directory-link"):
result['files'] += [v.text]
return result
class GithubRepoRecorder(icePick.Recorder):
struct = icePick.Structure(files=list())
class Meta:
database = db
class GithubRepoOrder(icePick.Order):
recorder = GithubRepoRecorder
parser = GithubRepoParser
def main():
document = {
'url': 'https://github.com/teitei-tk/ice-pick/tree/master',
'ua': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
}
print('---download start---')
order = GithubRepoOrder(document.get('url'), document.get('ua'))
picker = icePick.Picker([order])
picker.run()
print("---finish---")
if __name__ == "__main__":
main()
>>> import icePick >>> db = icePick.get_database('icePick_example', 'localhost') >>> class GithubRepoRecorder(icePick.Recorder): ... struct = icePick.Structure(files=list()) ... class Meta: ... database = db ... >>> records = GithubRepoRecorder.find() >>> records[0].files ['example', 'icePick', 'tests', 'LICENSE', 'README.md', 'circle.yml', 'requirements.txt'] >>>
TODO
Crawling
Document
LICENSE
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
icePick-0.0.3.tar.gz
(6.3 kB
view details)
File details
Details for the file icePick-0.0.3.tar.gz
.
File metadata
- Download URL: icePick-0.0.3.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1995e0437b730972e66775dafeaccfe433045b0164f1e8bf7a5be2ef3c43d0d7 |
|
MD5 | 609951c0d60ac17381ec1c57692cd11c |
|
BLAKE2b-256 | 1aec168a9bc7d32960b4918f7fe6cf36ba40d26dfe532926b5ee096482f83367 |