Skip to main content

A simple webscraper

Project description

PyExtract

How to use

_request

  • call the _request() function, it will first try a request with the request libary and then with selenium
  • fill out these keywords: url: str, keyword: str, headers: dict = None, soup:bool=False, max_retry:int=2, wait:int=0
  • Explanation:
    • url : request url
    • keyword: the keyword that should be in the website to know whether or not it got the right website, use '' to ignore
    • headers: request header in dicit form, use {} for no headers, leave empty for basic request header
    • soup : Whether or not returned as a soup object
    • max_retry: how often it reties the request (boath the normal and selenium) to get a response containing the keyword

multi_request

  • calls the _request in multiprocessing
  • the first argument just uses a list of lists of these 3 arguments: [url, keyword, headers] (lenght of list determines how many request are done)
  • new argument: process: int = 1, just determines how many processes are called at the same time
  • the rest are just the same as _request, but apply to every request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagecrawler-1.0.1.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pagecrawler-1.0.1-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file pagecrawler-1.0.1.tar.gz.

File metadata

  • Download URL: pagecrawler-1.0.1.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for pagecrawler-1.0.1.tar.gz
Algorithm Hash digest
SHA256 c68acf0e74bb0660fb1dc0188c2a517a09428b91b58215ba880bde727814453d
MD5 7963297f72ab1084f3216e314145669e
BLAKE2b-256 be5e67173f3bc8099c616028839ff460354cd9d45f9db33c3524095d22e1dad5

See more details on using hashes here.

File details

Details for the file pagecrawler-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pagecrawler-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for pagecrawler-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9ce195537c473d1c85dbd08745092b7708bb43997ce8714daf5d2834293453e2
MD5 4bc26a35d14720aa0d3761683098f1de
BLAKE2b-256 cd0c8f8d406cda7f53ac591d0e94c54462ad7b98240944d0d99abfeb4c45041c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page