Skip to main content

Py3 version

Project description

# gopage

## Overview

Gopage is a Google search page crawler. It provides concise APIs to download Google search results, and parse them into readable data structures.



## APIs

***Crawler*** is responsible for crawling Google search page, given a query sentence. ***Parser*** aims at parsing the html content of a Google search page into a Python list, in which each element is a dict consisting of 'title' and 'content' of the corresponding Google snippet.

Quick start:

```python
from gopage import crawler
from gopage import parser

gpage = crawler.search('Jie Tang') # My undergraduate advisor
snippets = parser.parse(gpage)
```

It should look like this:

```python
>>> from pprint import pprint
>>> pprint(snippets)
[{'content': 'Jie Tang (Tang, Jie). Associate Professor, IEEE Senior Member, '
'ACM Professional Member, CCF Distinguished Member. Knowledge '
'Engineering Lab (Group)',
'title': "Jie Tang (Tang, Jie) 's Homepage"},
{'content': 'Arnetminer: extraction and mining of academic social networks. J '
'Tang, J Zhang, L Yao, J Li, L Zhang, Z Su. Proceedings of the '
'14th ACM SIGKDD international\xa0...',
'title': 'Tang Jie - Google Scholar Citations'},
{'content': 'Jie Tang is an associate professor at Department of Computer '
'Science of Tsinghua University. He is known for the academic '
'social network search system\xa0...',
'title': 'Jie Tang - Wikipedia'},
{'content': 'Jie Tang, Yongqiang Sun, Shishu Yang, Yiyue Sun: Revisit the '
'Information Adoption Model by Exploring the Moderating Role of '
'Tie strength: a Perspective from\xa0...',
'title': 'dblp: Jie Tang'},
{'content': 'Jan 21, 2011 - Research. I am currently a third year computer '
'science Ph.D. student at UC Berkeley. My advisor is Pieter '
'Abbeel. I am interested in machine\xa0...',
'title': 'Jie Tang - University of California, Berkeley'},
{'content': 'TANG, Jie. Group Leader, Advanced Low-Dimensional Nanomaterials '
'Group, C4GR, National Institute for Materials Science. Email: '
'TANG.Jie nims.go.jp.',
'title': 'TANG, Jie | NIMS'},
{'content': 'Online shopping from a great selection at Books Store.',
'title': 'Amazon.com: LIU JIA JIE TANG REN WANG YI MING: Books'},
{'content': 'I obtained my Ph.D. degree from Tsinghua University in 2016, '
'advised by Jie Tang and Juanzi Li. During my Ph.D. career, I '
'have been visiting Cornell University\xa0...',
'title': 'Yang Yang - Zhejiang University'},
{'content': 'email email icon. Jie Tang Associate Professor of Medicine '
'(Clinical). Brown Affiliations. Medicine. Background. scroll to '
'property group menus. Background\xa0...',
'title': 'Tang, Jie - Researchers @ Brown - Brown University'},
{'content': 'Jie Tang. Tsinghua University. Beijing 100084, China '
'jietang@tsinghua.edu.cn. 1. Please share with us your view on '
'the history and important milestones of the\xa0...',
'title': 'A conversation with Professors Deyi Li and Jie Tang'}]
```

I also added a simple email address filter to parser. It helps you to findout all snippets containing email addresses.

```python
esnippets = parser.filt_email(snippets)

>>> pprint(esnippets)
[{'content': 'Jie Tang. Tsinghua University. Beijing 100084, China '
'jietang@tsinghua.edu.cn. 1. Please share with us your view on '
'the history and important milestones of the\xa0...',
'emails': ['jietang@tsinghua.edu.cn'],
'title': 'A conversation with Professors Deyi Li and Jie Tang'}]
```



## Signatures

***crawler.search(query, useproxy=True, verbose=True, maxtry=5, timeout=5)***

* query [str]: The query keywords. I'm only testing on English queries for now.
* useproxy [bool]: Whether to use a proxy pool to prevent being blocked.
* verbose [bool]: Whether to show current information, including proxy ip, target url, success or not and retry times.
* maxtry [int]: Max retry times.
* timeout [int]: Max waiting time, in seconds.
* @return gpage [str]

***parser.parse(gpage)***

* gpage [str]: The html content of a Google search page.
* @return snippets [list]

***parser.filt_email(snippets)***

* snippets [list]: Snippets extracted by parser.parse.
* @return snippets [list]



## Contact

Please feel free to let me know if you have any questions or suggestions. Have fun!

Author: Xiaotao Gu

Email: guxt1994@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gopage-2.1.tar.gz (4.4 kB view hashes)

Uploaded Source

Built Distribution

gopage-2.1-py3.6.egg (7.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page