Python wrapper for Prompt API's Scraper API
Project description
Prompt API - Scraper API - Python Package
pa-scraper
is a python wrapper for scraper api with few
more extra cream and sugar.
Requirements
- You need to signup for Prompt API
- You need to subscribe scraper api, test drive is free!!!
- You need to set
PROMPTAPI_TOKEN
environment variable after subscription.
then;
$ pip install pa-scraper
Example Usage
Examples can be found here.
# examples/fetch.py
from scraper import Scraper
url = 'https://pypi.org/classifiers/'
scraper = Scraper(url)
response = scraper.get()
if response.get('error', None):
# response['error'] returns error message
# response['status'] returns http status code
# Example: {'error': 'Not Found', 'status': 404}
print(response) # noqa: T001
else:
data = response['result']['data']
headers = response['result']['headers']
url = response['result']['url']
status = response['status']
# print(data) # print fetched html, will be long :)
print(headers) # noqa: T001
# {'Content-Length': '321322', 'Content-Type': 'text/html; charset=UTF-8', ... }
print(status) # noqa: T001
# 200
save_result = scraper.save('/tmp/my-data.html') # noqa: S108
if save_result.get('error', None):
# save error occured...
# add you code here...
pass
print(save_result) # noqa: T001
# {'file': '/tmp/my-data.html', 'size': 321322}
You can add url parameters for extra operations. Valid parameters are:
auth_password
: for HTTP Realm auth passwordauth_username
: for HTTP Realm auth usernamecookie
: URL Encoded cookie header.country
: 2 character country code. If you wish to scrape from an IP address of a specific country.referer
: HTTP referer headerselector
: CSS style selector path such asa.btn div li
. Ifselector
is enabled, returning result will be collection of data and saved file will be in.json
format.
Here is an example with using url parameters and selector
:
# examples/fetch_with_params.py
from scraper import Scraper
url = 'https://pypi.org/classifiers/'
scraper = Scraper(url)
fetch_params = dict(country='EE', selector='ul li button[data-clipboard-text]')
response = scraper.get(params=fetch_params)
if response.get('error', None):
# response['error'] returns error message
# response['status'] returns http status code
# Example: {'error': 'Not Found', 'status': 404}
print(response) # noqa: T001
else:
data = response['result']['data']
headers = response['result']['headers']
url = response['result']['url']
status = response['status']
# print(data) # noqa: T001
# ['<button class="button button--small margin-top margin-bottom copy-tooltip copy-tooltip-w" ...\n', ]
print(len(data)) # noqa: T001
# 734
# we have an array...
print(headers) # noqa: T001
# {'Content-Length': '321322', 'Content-Type': 'text/html; charset=UTF-8', ... }
print(status) # noqa: T001
# 200
save_result = scraper.save('/tmp/my-data.json') # noqa: S108
if save_result.get('error', None):
# save error occured...
# add you code here...
pass
print(save_result) # noqa: T001
# {'file': '/tmp/my-data.json', 'size': 174449}
Default timeout value is set to 10
seconds. You can change this while
initializing the instance:
scraper = Scraper(url, timeout=50) # 50 seconds timeout...
You can also add custom headers prefixed with X-
. Example below shows
how to add extra request headers and set default timeout:
# pylint: disable=C0103
from scraper import Scraper
if __name__ == '__main__':
url = 'https://pypi.org/classifiers/'
scraper = Scraper(url)
fetch_params = dict(country='EE', selector='ul li button[data-clipboard-text]')
custom_headers = {
'X-Referer': 'https://www.google.com',
'X-User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
}
timeout = 30
response = scraper.get(params=fetch_params, headers=custom_headers, timeout=timeout)
if response.get('error', None):
# response['error'] returns error message
# response['status'] returns http status code
# Example: {'error': 'Not Found', 'status': 404}
print(response) # noqa: T001
else:
data = response['result']['data']
headers = response['result']['headers']
url = response['result']['url']
status = response['status']
# print(data) # noqa: T001
# ['<button class="button button--small margin-top margin-bottom copy-tooltip copy-tooltip-w" ...\n', ]
print(len(data)) # noqa: T001
# 734
print(headers) # noqa: T001
# {'Content-Length': '321322', 'Content-Type': 'text/html; charset=UTF-8', ... }
print(status) # noqa: T001
# 200
save_result = scraper.save('/tmp/my-data.json') # noqa: S108
if save_result.get('error', None):
# save error occured...
# add you code here...
pass
print(save_result) # noqa: T001
# {'file': '/tmp/my-data.json', 'size': 174449}
License
This project is licensed under MIT
Contributer(s)
- Prompt API - Creator, maintainer
Contribute
All PR’s are welcome!
fork
(https://github.com/promptapi/scraper-py/fork)- Create your
branch
(git checkout -b my-feature
) commit
yours (git commit -am 'Add awesome features...'
)push
yourbranch
(git push origin my-feature
)- Than create a new Pull Request!
This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pa-scraper-0.2.4.tar.gz
.
File metadata
- Download URL: pa-scraper-0.2.4.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a17576baab63dab65f5d360342f7abb49f8d0f83cc081ca31ccf6dd91cff8d08 |
|
MD5 | f194a5e0da757c2a4f47e93058e8a907 |
|
BLAKE2b-256 | f38e6f8ed17374e959d7d16a75575ca647d8cb57179d8e96841a14bcc89754c3 |
File details
Details for the file pa_scraper-0.2.4-py3-none-any.whl
.
File metadata
- Download URL: pa_scraper-0.2.4-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29e6f324b547d042c3664d883af9ef7f991d915b2dcc1a98898875ea4bb96490 |
|
MD5 | a2830378e5d9aa771e0797d1000a602d |
|
BLAKE2b-256 | 6c5b4e932b9798c688051cb2326f3771367078ec5afabda5fad0972e7e4a9f4f |