Skip to main content

A simple webscraper

Project description

PyExtract

How to use

_request

  • call the _request() function, it will first try a request with the request libary and then with selenium
  • fill out these keywords: url: str, keyword: str, headers: dict = None, soup:bool=False, max_retry:int=2, wait:int=0
  • Explanation:
    • url : request url
    • keyword: the keyword that should be in the website to know whether or not it got the right website, use '' to ignore
    • headers: request header in dicit form, use {} for no headers, leave empty for basic request header
    • soup : Whether or not returned as a soup object
    • max_retry: how often it reties the request (boath the normal and selenium) to get a response containing the keyword

multi_request

  • calls the _request in multiprocessing
  • the first argument just uses a list of lists of these 3 arguments: [url, keyword, headers] (lenght of list determines how many request are done)
  • new argument: process: int = 1, just determines how many processes are called at the same time
  • the rest are just the same as _request, but apply to every request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagecrawler-1.1.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pagecrawler-1.1.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file pagecrawler-1.1.0.tar.gz.

File metadata

  • Download URL: pagecrawler-1.1.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for pagecrawler-1.1.0.tar.gz
Algorithm Hash digest
SHA256 0fb95bfeb04a25f631c4015094e0dd8769abcb715c5ed5e5e61a9ab295255136
MD5 00110f23109e878cdf11c108be77bc16
BLAKE2b-256 916c74f016ef15c3e86c11e63785420c5ad0fde8ebf56c50f13c10501cd32a4e

See more details on using hashes here.

File details

Details for the file pagecrawler-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pagecrawler-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for pagecrawler-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5792800857dfcb45a38a4d0826767957b330d233409b475ec0bd63f09becc62d
MD5 9f203b1cf04a0bc62ba1c857e9f359f0
BLAKE2b-256 6a5182405c8e44e3893f3a05c649fe1da002ea59e126d8df5155d3fd2883e4d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page