Skip to main content

Simple webscraper built on top of requests and beautifulsoup

Project description

Some basic webscraper I use in many projects.

https://img.shields.io/pypi/v/floscraper.svg https://img.shields.io/pypi/l/floscraper.svg https://img.shields.io/pypi/dm/floscraper.svg

webscraper

Module to ease web efforts

Supports

  • Cached web requests (Wrapper around requests)

  • Bultin parsing/scraping (Wrapper around beautifulsoup)

Constructor parameters

  • url: Default url, used if nothing else specified

  • scheme: Default scheme for scrapping

  • timeout

  • cache_directory: Where to save cache files

  • cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)

  • cache_use_advanced

  • auth_method: Authentication method (default: HTTPBasicAuth)

  • auth_username: Authentication username. If set, enables authentication

  • auth_password: Authentication password

  • handle_redirect: Allow redirects (default: True)

  • user_agent: User agent to use

  • default_user_agents_browser: Browser to set in user agent (from default_user_agents dict)

  • default_user_agents_os: Operating system to set in user agent (from default_user_agents dict)

  • user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)

  • user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)

  • html2text: HTML2text settings

  • html_parser: What html parser to use (default: html.parser - built in)

Example

# Setup WebScraper with caching
web = WebScraper({
    'cache_directory': "cache",
    'cache_time': 5*60
})

# First call to git -> hit internet
web.get("https://github.com/")

# Second call to git (within 5 minutes of first) -> hit cache
web.get("https://github.com/")

Whitch results in the following output:

2016-01-07 19:22:00 DEBUG   [WebScraper._getCached] From inet https://github.com
2016-01-07 19:22:00 INFO    [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com
2016-01-07 19:22:01 DEBUG   [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None
2016-01-07 19:22:01 DEBUG   [WebScraper._getCached] From cache https://github.com

History

0.3.1 (2019-08-04)

  • Fix __init__

  • Hashlib needs byte input

  • Cache hit/miss info

0.3.0 (2019-08-03)

  • Upgrade flotils

  • Remove tzinfo (default is utc)

0.2.3 (2018-07-02)

  • Upgrade flotils

0.2.2 (2017-12-07)

  • Fix cache duration bug

0.2.1 (2017-11-03)

  • Add raw response to unchached response

0.2.0 (2017-10-12)

  • Rework api names

  • Redesign caching

0.1.15a0 (2016-03-08)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

floscraper-0.3.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distributions

floscraper-0.3.1-py2.py3-none-any.whl (13.5 kB view details)

Uploaded Python 2 Python 3

floscraper-0.3.1-py2.7.egg (26.4 kB view details)

Uploaded Source

File details

Details for the file floscraper-0.3.1.tar.gz.

File metadata

  • Download URL: floscraper-0.3.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for floscraper-0.3.1.tar.gz
Algorithm Hash digest
SHA256 3adbdfe7069305874d6ea9e36171cae9680a13a9d691ed2d837b964fe6cd65d0
MD5 66089c8e57077e516afa65fc9620259d
BLAKE2b-256 81f5bbeb81b0bed8aa965802a6da744ae27c2d375d1f6d0f3b3b70f54c0601e6

See more details on using hashes here.

File details

Details for the file floscraper-0.3.1-py2.py3-none-any.whl.

File metadata

  • Download URL: floscraper-0.3.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for floscraper-0.3.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 921024c74707a08c879ede66141e72b11f1359304c1772058b47e326dd69e12f
MD5 f23f8be2038f0550d4ed9f09ae9f2eff
BLAKE2b-256 a8389d8cbd6c58595fa7d72e57fcdaad5d335e67460bdb1cf5de6fc3c44f2a94

See more details on using hashes here.

File details

Details for the file floscraper-0.3.1-py2.7.egg.

File metadata

  • Download URL: floscraper-0.3.1-py2.7.egg
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for floscraper-0.3.1-py2.7.egg
Algorithm Hash digest
SHA256 dbd037e0c277007e6992129a1cdfa9cbc503160071729930d88df3dbe3b8eba6
MD5 1ee4f96d5d2d996233e4338572e023e3
BLAKE2b-256 85147d7c5919a4d94935c9fafca2b4342abb0e1441afccc9a778636e83334a81

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page