Simple webscraper built on top of requests and beautifulsoup
Project description
Some basic webscraper I use in many projects.
webscraper
Module to ease web efforts
Supports
Cached web requests (Wrapper around requests)
Bultin parsing/scraping (Wrapper around beautifulsoup)
Constructor parameters
url: Default url, used if nothing else specified
scheme: Default scheme for scrapping
timeout
cache_directory: Where to save cache files
cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)
cache_use_advanced
auth_method: Authentication method (default: HTTPBasicAuth)
auth_username: Authentication username. If set, enables authentication
auth_password: Authentication password
handle_redirect: Allow redirects (default: True)
user_agent: User agent to use
default_user_agents_browser: Browser to set in user agent (from default_user_agents dict)
default_user_agents_os: Operating system to set in user agent (from default_user_agents dict)
user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)
user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)
html2text: HTML2text settings
html_parser: What html parser to use (default: html.parser - built in)
Example
# Setup WebScraper with caching
web = WebScraper({
'cache_directory': "cache",
'cache_time': 5*60
})
# First call to git -> hit internet
web.get("https://github.com/")
# Second call to git (within 5 minutes of first) -> hit cache
web.get("https://github.com/")
Whitch results in the following output:
2016-01-07 19:22:00 DEBUG [WebScraper._getCached] From inet https://github.com 2016-01-07 19:22:00 INFO [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com 2016-01-07 19:22:01 DEBUG [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None 2016-01-07 19:22:01 DEBUG [WebScraper._getCached] From cache https://github.com
History
0.3.1 (2019-08-04)
Fix __init__
Hashlib needs byte input
Cache hit/miss info
0.3.0 (2019-08-03)
Upgrade flotils
Remove tzinfo (default is utc)
0.2.3 (2018-07-02)
Upgrade flotils
0.2.2 (2017-12-07)
Fix cache duration bug
0.2.1 (2017-11-03)
Add raw response to unchached response
0.2.0 (2017-10-12)
Rework api names
Redesign caching
0.1.15a0 (2016-03-08)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for floscraper-0.3.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 921024c74707a08c879ede66141e72b11f1359304c1772058b47e326dd69e12f |
|
MD5 | f23f8be2038f0550d4ed9f09ae9f2eff |
|
BLAKE2b-256 | a8389d8cbd6c58595fa7d72e57fcdaad5d335e67460bdb1cf5de6fc3c44f2a94 |