Simple, powerful and pythonic web page search results crawler.
Project description
PageFLow
PageFlow is a Python (2 and 3) library for webpage search result crawler. It provides a simple API and support Google, Baidu, Bing search engines. [https://pypi.org/project/pageflow/]
Features
- support pages argument instead of just the first pate result.
- support redirect pages information extraction.
Installation
1. using pip
pip install pageflow
2. using setup.py
git clone https://github.com/Lapis-Hong/PageFlow.git
cd PageFlow
pip setup.py install
Usage
from pageflow import PageFlow
query = "python"
pages = 1 # search results total pages
pf = PageFlow("baidu", proxies=None)
# Get search page html.
html = pf.get_html(query=query, pages=pages)
# The following results are all generator of SearchResult obj.
# Get search result urls.
url = pf.get_url(query=query, pages=pages)
# Get search result titles.
title = pf.get_title(query=query, pages=pages)
# Get search result abstract.
abstract = pf.get_abstract(query=query, pages=pages)
# Get search result redirect html.
redirect_html = pf.get_redirect_html(query=query, pages=pages)
# Get search result redirect content.
redirect_content = pf.get_redirect_content(query=query, pages=pages)
# Get search result title, abstract and url.
result = pf.get(query=query, pages=pages)
# Get search result title, abstract, url, redirect html and redirect content.
result_all = pf.get_all(query=query, pages=pages)
References
https://github.com/howie6879/magic_google https://github.com/meibenjin/GoogleSearchCrawler https://github.com/chrislinan/cx-extractor-python
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pageflow-0.1.tar.gz
(10.0 kB
view details)
Built Distribution
pageflow-0.1-py2-none-any.whl
(12.8 kB
view details)
File details
Details for the file pageflow-0.1.tar.gz
.
File metadata
- Download URL: pageflow-0.1.tar.gz
- Upload date:
- Size: 10.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.18.3 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.19.4 CPython/2.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 123ccdeebd28889fe1688c4fb4db30176f920167295cf657044305299d1bec5a |
|
MD5 | 96beda6eb6d15c7570f75937cd978370 |
|
BLAKE2b-256 | afb816726119ea0ffb9659352c493b3d94eaf99a91915c85e201bfd18f3fd1e5 |
File details
Details for the file pageflow-0.1-py2-none-any.whl
.
File metadata
- Download URL: pageflow-0.1-py2-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.18.3 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.19.4 CPython/2.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f4740a07f8b4d24f70605d6adf4300332609da8603d1caa1a36e4bdd685dd1c |
|
MD5 | 8407972348cf19449e6186844edd1395 |
|
BLAKE2b-256 | 5d42609573cf360730c1224375b6afb59356994738dd0d294669bf094a919065 |