Simple webscraper built on top of requests and beautifulsoup
Project description
Some basic webscraper I use in many projects.
webscraper
Module to ease web efforts
Supports
Cached web requests (Wrapper around requests)
Bultin parsing/scraping (Wrapper around beautifulsoup)
Constructor parameters
url: Default url, used if nothing else specified
scheme: Default scheme for scrapping
timeout
cache_directory: Where to save cache files
cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)
cache_use_advanced
auth_method: Authentication method (default: HTTPBasicAuth)
auth_username: Authentication username. If set, enables authentication
auth_password: Authentication password
handle_redirect: Allow redirects (default: True)
user_agent: User agent to use
default_user_agents_browser: Browser to set in user agent (from default_user_agents dict)
default_user_agents_os: Operating system to set in user agent (from default_user_agents dict)
user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)
user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)
html2text: HTML2text settings
html_parser: What html parser to use (default: html.parser - built in)
Example
# Setup WebScraper with caching
web = WebScraper({
'cache_directory': "cache",
'cache_time': 5*60
})
# First call to git -> hit internet
web.get("https://github.com/")
# Second call to git (within 5 minutes of first) -> hit cache
web.get("https://github.com/")
Whitch results in the following output:
2016-01-07 19:22:00 DEBUG [WebScraper._getCached] From inet https://github.com 2016-01-07 19:22:00 INFO [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com 2016-01-07 19:22:01 DEBUG [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None 2016-01-07 19:22:01 DEBUG [WebScraper._getCached] From cache https://github.com
History
0.2.0 (2017-10-12)
Rework api names
Redesign caching
0.1.15a0 (2016-03-08)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for floscraper-0.2.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6eeb432a5c516777227ff97045d897cc4ad5bb9a4fb2b8258892f2ca3a84af1d |
|
MD5 | d8f422899a26cd0d2fe980a1b8471f00 |
|
BLAKE2b-256 | b3ab94b1f709e01ce735f81377c4491002ae47ae7f0cb4f93f16340e6801d543 |