Simple webscraper built on top of requests and beautifulsoup
Project description
Some basic webscraper I use in many projects.
webscraper
Module to ease web efforts
Supports
Cached web requests (Wrapper around requests)
Bultin parsing/scraping (Wrapper around beautifulsoup)
Constructor parameters
url: Default url, used if nothing else specified
scheme: Default scheme for scrapping
timeout
cache_directory: Where to save cache files
cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)
cache_use_advanced
auth_method: Authentication method (default: HTTPBasicAuth)
auth_username: Authentication username. If set, enables authentication
auth_password: Authentication password
handle_redirect: Allow redirects (default: True)
user_agent: User agent to use
default_user_agents_browser: Browser to set in user agent (from default_user_agents dict)
default_user_agents_os: Operating system to set in user agent (from default_user_agents dict)
user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)
user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)
html2text: HTML2text settings
html_parser: What html parser to use (default: html.parser - built in)
Example
# Setup WebScraper with caching
web = WebScraper({
'cache_directory': "cache",
'cache_time': 5*60
})
# First call to git -> hit internet
web.get("https://github.com/")
# Second call to git (within 5 minutes of first) -> hit cache
web.get("https://github.com/")
Whitch results in the following output:
2016-01-07 19:22:00 DEBUG [WebScraper._getCached] From inet https://github.com 2016-01-07 19:22:00 INFO [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com 2016-01-07 19:22:01 DEBUG [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None 2016-01-07 19:22:01 DEBUG [WebScraper._getCached] From cache https://github.com
History
0.3.1 (2019-08-04)
Fix __init__
Hashlib needs byte input
Cache hit/miss info
0.3.0 (2019-08-03)
Upgrade flotils
Remove tzinfo (default is utc)
0.2.3 (2018-07-02)
Upgrade flotils
0.2.2 (2017-12-07)
Fix cache duration bug
0.2.1 (2017-11-03)
Add raw response to unchached response
0.2.0 (2017-10-12)
Rework api names
Redesign caching
0.1.15a0 (2016-03-08)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file floscraper-0.3.1.tar.gz
.
File metadata
- Download URL: floscraper-0.3.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3adbdfe7069305874d6ea9e36171cae9680a13a9d691ed2d837b964fe6cd65d0 |
|
MD5 | 66089c8e57077e516afa65fc9620259d |
|
BLAKE2b-256 | 81f5bbeb81b0bed8aa965802a6da744ae27c2d375d1f6d0f3b3b70f54c0601e6 |
File details
Details for the file floscraper-0.3.1-py2.py3-none-any.whl
.
File metadata
- Download URL: floscraper-0.3.1-py2.py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 921024c74707a08c879ede66141e72b11f1359304c1772058b47e326dd69e12f |
|
MD5 | f23f8be2038f0550d4ed9f09ae9f2eff |
|
BLAKE2b-256 | a8389d8cbd6c58595fa7d72e57fcdaad5d335e67460bdb1cf5de6fc3c44f2a94 |
File details
Details for the file floscraper-0.3.1-py2.7.egg
.
File metadata
- Download URL: floscraper-0.3.1-py2.7.egg
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbd037e0c277007e6992129a1cdfa9cbc503160071729930d88df3dbe3b8eba6 |
|
MD5 | 1ee4f96d5d2d996233e4338572e023e3 |
|
BLAKE2b-256 | 85147d7c5919a4d94935c9fafca2b4342abb0e1441afccc9a778636e83334a81 |