A package for getting data from the intenet
Project description
This package include modules for findng links in a webpage and its children.
In the main module find_links_by_extension links are found using two sub-modules and then added together:
Using Google Search Results (get_links_using_Google_search)
Since we can specify which types of files we are looking for when we search in Google, this methos scrapes these results. But this method is not complete:
Google search works based on crawlers, and sometimes they don’t index properly. For example [this][1] webpage has three pdf files at the moment (Aug 7 2018), but when we [use google search][2] to find them it finds only two although the files were uploaded 4 years ago.
It doesn’t work with some websites. For example [this][3] webpage has three pdf files but google [cannot find any][4].
If many requests are sent in a short period of time, Google blocks access and asks for CAPTCHA solving.
Using a direct method of finding all urls in the given page and following those links if they are refering to children pages and seach recursively (get_links_directly)
While this method does not miss any files in pages that it gets to (in contrast to method 1 which sometimes do), it may not find all the files because:
Some webpages in the domain may be isolated i.e. there is no link to them in the parent pages. For these cases method 1 above works.
In rare cases the link to a file of type xyz may not have .xyz in the link ([example][5]). In these cases method 2 cannot detect the file (because it only relies on the extesion appearing in the links), but method 1 detects correctly in these cases.
So the two methods complete each other’s gaps.
[1]: http://www.midi.gouv.qc.ca/publications/en/planification/ [2]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.midi.gouv.qc.ca%2Fpublications%2Fen%2Fplanification%2F+filetype%3Apdf [3]: http://www.sfu.ca/~vvaezian/Summary/ [4]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.sfu.ca%2F~vvaezian%2FSummary%2F+filetype%3Apdf [5]: http://www.sfu.ca/~robson/Random
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for web_scraper-1.0-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35f6600243771447ee726165cb8fd832ac4436b57ce7027fcf25cbb43da96686 |
|
MD5 | 58a1fdf6ce23d61e31242ced9d55c62d |
|
BLAKE2b-256 | 2601e3d461199c9341b7d39061c14b1af914654d00769241503a87f77505f95f |