gopage

Py3 version

These details have not been verified by PyPI

Project links

Homepage

Project description

# gopage

## Overview

Gopage is a Google search page crawler. It provides concise APIs to download Google search results, and parse them into readable data structures.

## APIs

***Crawler*** is responsible for crawling Google search page, given a query sentence. ***Parser*** aims at parsing the html content of a Google search page into a Python list, in which each element is a dict consisting of 'title' and 'content' of the corresponding Google snippet.

Quick start:

```python
from gopage import crawler
from gopage import parser

gpage = crawler.search('Jie Tang') # My undergraduate advisor
snippets = parser.parse(gpage)
```

It should look like this:

```python
>>> from pprint import pprint
>>> pprint(snippets)
[{'content': 'Jie Tang (Tang, Jie). Associate Professor, IEEE Senior Member, '
'ACM Professional Member, CCF Distinguished Member. Knowledge '
'Engineering Lab (Group)',
'title': "Jie Tang (Tang, Jie) 's Homepage"},
{'content': 'Arnetminer: extraction and mining of academic social networks. J '
'Tang, J Zhang, L Yao, J Li, L Zhang, Z Su. Proceedings of the '
'14th ACM SIGKDD international\xa0...',
'title': 'Tang Jie - Google Scholar Citations'},
{'content': 'Jie Tang is an associate professor at Department of Computer '
'Science of Tsinghua University. He is known for the academic '
'social network search system\xa0...',
'title': 'Jie Tang - Wikipedia'},
{'content': 'Jie Tang, Yongqiang Sun, Shishu Yang, Yiyue Sun: Revisit the '
'Information Adoption Model by Exploring the Moderating Role of '
'Tie strength: a Perspective from\xa0...',
'title': 'dblp: Jie Tang'},
{'content': 'Jan 21, 2011 - Research. I am currently a third year computer '
'science Ph.D. student at UC Berkeley. My advisor is Pieter '
'Abbeel. I am interested in machine\xa0...',
'title': 'Jie Tang - University of California, Berkeley'},
{'content': 'TANG, Jie. Group Leader, Advanced Low-Dimensional Nanomaterials '
'Group, C4GR, National Institute for Materials Science. Email: '
'TANG.Jie nims.go.jp.',
'title': 'TANG, Jie | NIMS'},
{'content': 'Online shopping from a great selection at Books Store.',
'title': 'Amazon.com: LIU JIA JIE TANG REN WANG YI MING: Books'},
{'content': 'I obtained my Ph.D. degree from Tsinghua University in 2016, '
'advised by Jie Tang and Juanzi Li. During my Ph.D. career, I '
'have been visiting Cornell University\xa0...',
'title': 'Yang Yang - Zhejiang University'},
{'content': 'email email icon. Jie Tang Associate Professor of Medicine '
'(Clinical). Brown Affiliations. Medicine. Background. scroll to '
'property group menus. Background\xa0...',
'title': 'Tang, Jie - Researchers @ Brown - Brown University'},
{'content': 'Jie Tang. Tsinghua University. Beijing 100084, China '
'jietang@tsinghua.edu.cn. 1. Please share with us your view on '
'the history and important milestones of the\xa0...',
'title': 'A conversation with Professors Deyi Li and Jie Tang'}]
```

I also added a simple email address filter to parser. It helps you to findout all snippets containing email addresses.

```python
esnippets = parser.filt_email(snippets)

>>> pprint(esnippets)
[{'content': 'Jie Tang. Tsinghua University. Beijing 100084, China '
'jietang@tsinghua.edu.cn. 1. Please share with us your view on '
'the history and important milestones of the\xa0...',
'emails': ['jietang@tsinghua.edu.cn'],
'title': 'A conversation with Professors Deyi Li and Jie Tang'}]
```

## Signatures

***crawler.search(query, useproxy=True, verbose=True, maxtry=5, timeout=5)***

* query [str]: The query keywords. I'm only testing on English queries for now.
* useproxy [bool]: Whether to use a proxy pool to prevent being blocked.
* verbose [bool]: Whether to show current information, including proxy ip, target url, success or not and retry times.
* maxtry [int]: Max retry times.
* timeout [int]: Max waiting time, in seconds.
* @return gpage [str]

***parser.parse(gpage)***

* gpage [str]: The html content of a Google search page.
* @return snippets [list]

***parser.filt_email(snippets)***

* snippets [list]: Snippets extracted by parser.parse.
* @return snippets [list]

## Contact

Please feel free to let me know if you have any questions or suggestions. Have fun!

Author: Xiaotao Gu

Email: guxt1994@gmail.com

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.6.3

Mar 23, 2017

2.6.2

Mar 22, 2017

2.6.1

Mar 22, 2017

2.6

Mar 18, 2017

2.2

Mar 15, 2017

2.1

Mar 15, 2017

This version

1.5

Mar 15, 2017

1.3

Mar 15, 2017

1.2

Mar 15, 2017

1.1

Apr 20, 2016

1.0

Apr 12, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gopage-1.5.tar.gz (4.4 kB view details)

Uploaded Mar 15, 2017 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gopage-1.5-py3.6.egg (7.7 kB view details)

Uploaded Mar 15, 2017 Egg

File details

Details for the file gopage-1.5.tar.gz.

File metadata

Download URL: gopage-1.5.tar.gz
Upload date: Mar 15, 2017
Size: 4.4 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for gopage-1.5.tar.gz
Algorithm	Hash digest
SHA256	`ecfafc2199a1faecaff12f4a951904a544fd7a565f9eeb3eb136c4c95ffccd25`
MD5	`feaa590bb8bfd1d254d9ddd27a7a0c94`
BLAKE2b-256	`abe07e023c8524bfd96a1c9789214bd1631ae801a3c39095e85afad61121a1cd`

See more details on using hashes here.

File details

Details for the file gopage-1.5-py3.6.egg.

File metadata

Download URL: gopage-1.5-py3.6.egg
Upload date: Mar 15, 2017
Size: 7.7 kB
Tags: Egg
Uploaded using Trusted Publishing? No

File hashes

Hashes for gopage-1.5-py3.6.egg
Algorithm	Hash digest
SHA256	`969eb850ebbfbb5a1d61c8b8868f8b1183e66cf7ef40f573fa34a10e452d890d`
MD5	`e66635d91f559da402e6fcbd3d14ac58`
BLAKE2b-256	`89ee4ce177414e30b380f653c1660e4f44fccfaf70bd6a175a21ecd52f11188e`

See more details on using hashes here.

gopage 1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes