Skip to main content

A package for getting data from the intenet

Project description

This package include modules for findng links in a webpage and its children.

In the main module find_links_by_extension links are found using two sub-modules and then added together:

  1. Using Google Search Results (get_links_using_Google_search)

Since we can specify which types of files we are looking for when we search in Google, this methos scrapes these results. But this method is not complete:

  1. Google search works based on crawlers, and sometimes they don’t index properly. For example [this][1] webpage has three pdf files at the moment (Aug 7 2018), but when we [use google search][2] to find them it finds only two although the files were uploaded 4 years ago.

  2. It doesn’t work with some websites. For example [this][3] webpage has three pdf files but google [cannot find any][4].

  3. If many requests are sent in a short period of time, Google blocks access and asks for CAPTCHA solving.

  1. Using a direct method of finding all urls in the given page and following those links if they are refering to children pages and seach recursively (get_links_directly)

While this method does not miss any files in pages that it gets to (in contrast to method 1 which sometimes do), it may not find all the files because:

  1. Some webpages in the domain may be isolated i.e. there is no link to them in the parent pages. For these cases method 1 above works.

  2. In rare cases the link to a file of type xyz may not have .xyz in the link ([example][5]). In these cases method 2 cannot detect the file (because it only relies on the extesion appearing in the links), but method 1 detects correctly in these cases.

So the two methods complete each other’s gaps.

[1]: http://www.midi.gouv.qc.ca/publications/en/planification/ [2]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.midi.gouv.qc.ca%2Fpublications%2Fen%2Fplanification%2F+filetype%3Apdf [3]: http://www.sfu.ca/~vvaezian/Summary/ [4]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.sfu.ca%2F~vvaezian%2FSummary%2F+filetype%3Apdf [5]: http://www.sfu.ca/~robson/Random

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scraper-1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

web_scraper-1.0-py2-none-any.whl (10.8 kB view details)

Uploaded Python 2

File details

Details for the file web_scraper-1.0.tar.gz.

File metadata

  • Download URL: web_scraper-1.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for web_scraper-1.0.tar.gz
Algorithm Hash digest
SHA256 ddb620311ebd618b3cee8ed6b08bf30f3813d710f9fef333852637152c00f702
MD5 bce6fd352d18e6eff36f5d5bbad38b1e
BLAKE2b-256 b445116acaa0e9242103e5c23cea4f368a5516d96386795994f9187b92015727

See more details on using hashes here.

File details

Details for the file web_scraper-1.0-py2-none-any.whl.

File metadata

  • Download URL: web_scraper-1.0-py2-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for web_scraper-1.0-py2-none-any.whl
Algorithm Hash digest
SHA256 35f6600243771447ee726165cb8fd832ac4436b57ce7027fcf25cbb43da96686
MD5 58a1fdf6ce23d61e31242ced9d55c62d
BLAKE2b-256 2601e3d461199c9341b7d39061c14b1af914654d00769241503a87f77505f95f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page