Skip to main content

Simple webscraper built on top of requests and beautifulsoup

Project description

Some basic webscraper I use in many projects.

https://img.shields.io/pypi/v/floscraper.svg https://img.shields.io/pypi/l/floscraper.svg https://img.shields.io/pypi/dm/floscraper.svg

webscraper

Module to ease web efforts

Supports

  • Cached web requests (Wrapper around requests)

  • Bultin parsing/scraping (Wrapper around beautifulsoup)

Constructor parameters

  • url: Default url, used if nothing else specified

  • scheme: Default scheme for scrapping

  • timeout

  • cache_directory: Where to save cache files

  • cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)

  • cache_use_advanced

  • auth_method: Authentication method (default: HTTPBasicAuth)

  • auth_username: Authentication username. If set, enables authentication

  • auth_password: Authentication password

  • handle_redirect: Allow redirects (default: True)

  • user_agent: User agent to use

  • default_user_agents_browser: Browser to set in user agent (from default_user_agents dict)

  • default_user_agents_os: Operating system to set in user agent (from default_user_agents dict)

  • user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)

  • user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)

  • html2text: HTML2text settings

  • html_parser: What html parser to use (default: html.parser - built in)

Example

# Setup WebScraper with caching
web = WebScraper({
    'cache_directory': "cache",
    'cache_time': 5*60
})

# First call to git -> hit internet
web.get("https://github.com/")

# Second call to git (within 5 minutes of first) -> hit cache
web.get("https://github.com/")

Whitch results in the following output:

2016-01-07 19:22:00 DEBUG   [WebScraper._getCached] From inet https://github.com
2016-01-07 19:22:00 INFO    [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com
2016-01-07 19:22:01 DEBUG   [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None
2016-01-07 19:22:01 DEBUG   [WebScraper._getCached] From cache https://github.com

History

0.2.0 (2017-10-12)

  • Rework api names

  • Redesign caching

0.1.15a0 (2016-03-08)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

floscraper-0.2.0.tar.gz (11.6 kB view details)

Uploaded Source

Built Distributions

floscraper-0.2.0-py2.py3-none-any.whl (14.2 kB view details)

Uploaded Python 2 Python 3

floscraper-0.2.0-py2.7.egg (26.2 kB view details)

Uploaded Source

File details

Details for the file floscraper-0.2.0.tar.gz.

File metadata

  • Download URL: floscraper-0.2.0.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for floscraper-0.2.0.tar.gz
Algorithm Hash digest
SHA256 748f374146aa90c671065c5281848c8a4408065487d4e41b60ddd0324fa229f9
MD5 c7ec5ea9a2d1d2b94cc0b1f4b90bc4ca
BLAKE2b-256 1ffd6506a2f7f0ef7ead9700b9d4d011366f716fa20f5d4987b4837ce4e28e7d

See more details on using hashes here.

File details

Details for the file floscraper-0.2.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for floscraper-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6eeb432a5c516777227ff97045d897cc4ad5bb9a4fb2b8258892f2ca3a84af1d
MD5 d8f422899a26cd0d2fe980a1b8471f00
BLAKE2b-256 b3ab94b1f709e01ce735f81377c4491002ae47ae7f0cb4f93f16340e6801d543

See more details on using hashes here.

File details

Details for the file floscraper-0.2.0-py2.7.egg.

File metadata

File hashes

Hashes for floscraper-0.2.0-py2.7.egg
Algorithm Hash digest
SHA256 3d8756c87f714b6ab8423df371bcce34cd62ce9140c6c672a9c2a9c06eb6c43a
MD5 eb7040dcdf26320753e59ff08a3ac80b
BLAKE2b-256 7d1a7f04ced214ebb26a29c8d06b63638a8af0516fc15fb967f9bac21f54905d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page