Skip to main content
Join the official Python Developers Survey 2018 and win valuable prizes: Start the survey!

A package for getting data from the intenet

Project description

This package include modules for findng links in a webpage and its children.

In the main module find_links_by_extension links are found using two sub-modules and then added together:

  1. Using Google Search Results (get_links_using_Google_search)

Since we can specify which types of files we are looking for when we search in Google, this methos scrapes these results. But this method is not complete:

  1. Google search works based on crawlers, and sometimes they don’t index properly. For example [this][1] webpage has three pdf files at the moment (Aug 7 2018), but when we [use google search][2] to find them it finds only two although the files were uploaded 4 years ago.
  2. It doesn’t work with some websites. For example [this][3] webpage has three pdf files but google [cannot find any][4].
  3. If many requests are sent in a short period of time, Google blocks access and asks for CAPTCHA solving.
  1. Using a direct method of finding all urls in the given page and following those links if they are refering to children pages and seach recursively (get_links_directly)

While this method does not miss any files in pages that it gets to (in contrast to method 1 which sometimes do), it may not find all the files because:

  1. Some webpages in the domain may be isolated i.e. there is no link to them in the parent pages. For these cases method 1 above works.
  2. In rare cases the link to a file of type xyz may not have .xyz in the link ([example][5]). In these cases method 2 cannot detect the file (because it only relies on the extesion appearing in the links), but method 1 detects correctly in these cases.

So the two methods complete each other’s gaps.

[1]: http://www.midi.gouv.qc.ca/publications/en/planification/ [2]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.midi.gouv.qc.ca%2Fpublications%2Fen%2Fplanification%2F+filetype%3Apdf [3]: http://www.sfu.ca/~vvaezian/Summary/ [4]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.sfu.ca%2F~vvaezian%2FSummary%2F+filetype%3Apdf [5]: http://www.sfu.ca/~robson/Random

Project details


Release history Release notifications

This version
History Node

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
web_scraper-1.0-py2-none-any.whl (10.8 kB) Copy SHA256 hash SHA256 Wheel py2 Aug 10, 2018
web_scraper-1.0.tar.gz (5.7 kB) Copy SHA256 hash SHA256 Source None Aug 10, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page