Skip to main content

Py3 version

Project description

# gopage

## Overview

Gopage is a Google search page crawler. It provides concise APIs to download Google search results, and parse them into readable data structures.



## APIs

***Crawler*** is responsible for crawling Google search page, given a query sentence. ***Parser*** aims at parsing the html content of a Google search page into a Python list, in which each element is a dict consisting of 'title' and 'content' of the corresponding Google snippet.

Quick start:

```python
from gopage import crawler
from gopage import parser

gpage = crawler.search('Jie Tang') # My undergraduate advisor
snippets = parser.parse(gpage)
```

It should look like this:

```python
>>> from pprint import pprint
>>> pprint(snippets)
[{'content': 'Jie Tang (Tang, Jie). Associate Professor, IEEE Senior Member, '
'ACM Professional Member, CCF Distinguished Member. Knowledge '
'Engineering Lab (Group)',
'title': "Jie Tang (Tang, Jie) 's Homepage"},
{'content': 'Arnetminer: extraction and mining of academic social networks. J '
'Tang, J Zhang, L Yao, J Li, L Zhang, Z Su. Proceedings of the '
'14th ACM SIGKDD international\xa0...',
'title': 'Tang Jie - Google Scholar Citations'},
{'content': 'Jie Tang is an associate professor at Department of Computer '
'Science of Tsinghua University. He is known for the academic '
'social network search system\xa0...',
'title': 'Jie Tang - Wikipedia'},
{'content': 'Jie Tang, Yongqiang Sun, Shishu Yang, Yiyue Sun: Revisit the '
'Information Adoption Model by Exploring the Moderating Role of '
'Tie strength: a Perspective from\xa0...',
'title': 'dblp: Jie Tang'},
{'content': 'Jan 21, 2011 - Research. I am currently a third year computer '
'science Ph.D. student at UC Berkeley. My advisor is Pieter '
'Abbeel. I am interested in machine\xa0...',
'title': 'Jie Tang - University of California, Berkeley'},
{'content': 'TANG, Jie. Group Leader, Advanced Low-Dimensional Nanomaterials '
'Group, C4GR, National Institute for Materials Science. Email: '
'TANG.Jie nims.go.jp.',
'title': 'TANG, Jie | NIMS'},
{'content': 'Online shopping from a great selection at Books Store.',
'title': 'Amazon.com: LIU JIA JIE TANG REN WANG YI MING: Books'},
{'content': 'I obtained my Ph.D. degree from Tsinghua University in 2016, '
'advised by Jie Tang and Juanzi Li. During my Ph.D. career, I '
'have been visiting Cornell University\xa0...',
'title': 'Yang Yang - Zhejiang University'},
{'content': 'email email icon. Jie Tang Associate Professor of Medicine '
'(Clinical). Brown Affiliations. Medicine. Background. scroll to '
'property group menus. Background\xa0...',
'title': 'Tang, Jie - Researchers @ Brown - Brown University'},
{'content': 'Jie Tang. Tsinghua University. Beijing 100084, China '
'jietang@tsinghua.edu.cn. 1. Please share with us your view on '
'the history and important milestones of the\xa0...',
'title': 'A conversation with Professors Deyi Li and Jie Tang'}]
```

I also added a simple email address filter to parser. It helps you to findout all snippets containing email addresses.

```python
esnippets = parser.filt_email(snippets)

>>> pprint(esnippets)
[{'content': 'Jie Tang. Tsinghua University. Beijing 100084, China '
'jietang@tsinghua.edu.cn. 1. Please share with us your view on '
'the history and important milestones of the\xa0...',
'emails': ['jietang@tsinghua.edu.cn'],
'title': 'A conversation with Professors Deyi Li and Jie Tang'}]
```



## Signatures

***crawler.search(query, useproxy=True, verbose=True, maxtry=5, timeout=5)***

* query [str]: The query keywords. I'm only testing on English queries for now.
* useproxy [bool]: Whether to use a proxy pool to prevent being blocked.
* verbose [bool]: Whether to show current information, including proxy ip, target url, success or not and retry times.
* maxtry [int]: Max retry times.
* timeout [int]: Max waiting time, in seconds.
* @return gpage [str]

***parser.parse(gpage)***

* gpage [str]: The html content of a Google search page.
* @return snippets [list]

***parser.filt_email(snippets)***

* snippets [list]: Snippets extracted by parser.parse.
* @return snippets [list]



## Contact

Please feel free to let me know if you have any questions or suggestions. Have fun!

Author: Xiaotao Gu

Email: guxt1994@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gopage-1.5.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gopage-1.5-py3.6.egg (7.7 kB view details)

Uploaded Egg

File details

Details for the file gopage-1.5.tar.gz.

File metadata

  • Download URL: gopage-1.5.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for gopage-1.5.tar.gz
Algorithm Hash digest
SHA256 ecfafc2199a1faecaff12f4a951904a544fd7a565f9eeb3eb136c4c95ffccd25
MD5 feaa590bb8bfd1d254d9ddd27a7a0c94
BLAKE2b-256 abe07e023c8524bfd96a1c9789214bd1631ae801a3c39095e85afad61121a1cd

See more details on using hashes here.

File details

Details for the file gopage-1.5-py3.6.egg.

File metadata

  • Download URL: gopage-1.5-py3.6.egg
  • Upload date:
  • Size: 7.7 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for gopage-1.5-py3.6.egg
Algorithm Hash digest
SHA256 969eb850ebbfbb5a1d61c8b8868f8b1183e66cf7ef40f573fa34a10e452d890d
MD5 e66635d91f559da402e6fcbd3d14ac58
BLAKE2b-256 89ee4ce177414e30b380f653c1660e4f44fccfaf70bd6a175a21ecd52f11188e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page