Skip to main content

Simple webscraper built on top of requests and beautifulsoup

Project description

Some basic webscraper I use in many projects.

https://img.shields.io/pypi/v/floscraper.svg https://img.shields.io/pypi/l/floscraper.svg https://img.shields.io/pypi/dm/floscraper.svg

webscraper

Module to ease web efforts

Supports

  • Cached web requests (Wrapper around requests)

  • Bultin parsing/scraping (Wrapper around beautifulsoup)

Constructor parameters

  • url: Default url, used if nothing else specified

  • scheme: Default scheme for scrapping

  • timeout

  • cache_directory: Where to save cache files

  • cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)

  • cache_use_advanced

  • auth_method: Authentication method (default: HTTPBasicAuth)

  • auth_username: Authentication username. If set, enables authentication

  • auth_password: Authentication password

  • handle_redirect: Allow redirects (default: True)

  • user_agent: User agent to use

  • default_user_agents_browser: Browser to set in user agent (from default_user_agents dict)

  • default_user_agents_os: Operating system to set in user agent (from default_user_agents dict)

  • user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)

  • user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)

  • html2text: HTML2text settings

  • html_parser: What html parser to use (default: html.parser - built in)

Example

# Setup WebScraper with caching
web = WebScraper({
    'cache_directory': "cache",
    'cache_time': 5*60
})

# First call to git -> hit internet
web.get("https://github.com/")

# Second call to git (within 5 minutes of first) -> hit cache
web.get("https://github.com/")

Whitch results in the following output:

2016-01-07 19:22:00 DEBUG   [WebScraper._getCached] From inet https://github.com
2016-01-07 19:22:00 INFO    [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com
2016-01-07 19:22:01 DEBUG   [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None
2016-01-07 19:22:01 DEBUG   [WebScraper._getCached] From cache https://github.com

History

0.1.15a0 (2016-03-08)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

floscraper-0.1.15a1.zip (15.4 kB view details)

Uploaded Source

floscraper-0.1.15a1.win32.zip (21.2 kB view details)

Uploaded Source

Built Distributions

floscraper-0.1.15a1-py2.py3-none-any.whl (11.8 kB view details)

Uploaded Python 2 Python 3

floscraper-0.1.15a1-py2.7.egg (19.8 kB view details)

Uploaded Source

File details

Details for the file floscraper-0.1.15a1.zip.

File metadata

  • Download URL: floscraper-0.1.15a1.zip
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for floscraper-0.1.15a1.zip
Algorithm Hash digest
SHA256 184bdd9643e021bc617e0f54c3ba3035de804d0e6f6af84186a1c141f4105e6b
MD5 32acec557a74c7c9342f054abd5fb89e
BLAKE2b-256 831c19a081f7238e64978871f4ae77b999a16aa6d9361cef18ab8f0f16cd388b

See more details on using hashes here.

File details

Details for the file floscraper-0.1.15a1.win32.zip.

File metadata

File hashes

Hashes for floscraper-0.1.15a1.win32.zip
Algorithm Hash digest
SHA256 64d2465488ca8172fbadb3e409951479934ca092b053988ec559ff3251607896
MD5 7aec320441cf9b7c42e1e7aa55e3fc3c
BLAKE2b-256 be5e1b37f115c0c574651452e4c37e644cf09ea184690268928865742880aab8

See more details on using hashes here.

File details

Details for the file floscraper-0.1.15a1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for floscraper-0.1.15a1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1d6087210bf2082c9da79158cc6dc3b2a923fe41ba9a9c55a1199e569009aec9
MD5 428e57582667cd7d88b8ea94251b39c5
BLAKE2b-256 fe961107154a5f7e0972ed172ca3fa5bd53fd8dd8f24d9b9a096d1e45ea18baf

See more details on using hashes here.

File details

Details for the file floscraper-0.1.15a1-py2.7.egg.

File metadata

File hashes

Hashes for floscraper-0.1.15a1-py2.7.egg
Algorithm Hash digest
SHA256 b69b5adcc8d65e36c6f059d022902c16696096aef7e8ee7f5ec8e915242efcae
MD5 2d0e53766cfafc21a656e798331839e2
BLAKE2b-256 8640b7899aaf501a3f85f5a2aaa6bfb477d98ea651fe0d3aa82c0d5ff6b60911

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page