A package for getting data from the intenet
Project description
This package include modules for findng links in a webpage and its children.
In the main module find_links_by_extension links are found using two sub-modules and then added together:
Using Google Search Results (get_links_using_Google_search)
Since we can specify which types of files we are looking for when we search in Google, this methos scrapes these results. But this method is not complete:
Google search works based on crawlers, and sometimes they don’t index properly. For example [this][1] webpage has three pdf files at the moment (Aug 7 2018), but when we [use google search][2] to find them it finds only two although the files were uploaded 4 years ago.
It doesn’t work with some websites. For example [this][3] webpage has three pdf files but google [cannot find any][4].
If many requests are sent in a short period of time, Google blocks access and asks for CAPTCHA solving.
Using a direct method of finding all urls in the given page and following those links if they are refering to children pages and seach recursively (get_links_directly)
While this method does not miss any files in pages that it gets to (in contrast to method 1 which sometimes do), it may not find all the files because:
Some webpages in the domain may be isolated i.e. there is no link to them in the parent pages. For these cases method 1 above works.
In rare cases the link to a file of type xyz may not have .xyz in the link ([example][5]). In these cases method 2 cannot detect the file (because it only relies on the extesion appearing in the links), but method 1 detects correctly in these cases.
So the two methods complete each other’s gaps.
[1]: http://www.midi.gouv.qc.ca/publications/en/planification/ [2]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.midi.gouv.qc.ca%2Fpublications%2Fen%2Fplanification%2F+filetype%3Apdf [3]: http://www.sfu.ca/~vvaezian/Summary/ [4]: https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.sfu.ca%2F~vvaezian%2FSummary%2F+filetype%3Apdf [5]: http://www.sfu.ca/~robson/Random
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file web_scraper-1.0.tar.gz
.
File metadata
- Download URL: web_scraper-1.0.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddb620311ebd618b3cee8ed6b08bf30f3813d710f9fef333852637152c00f702 |
|
MD5 | bce6fd352d18e6eff36f5d5bbad38b1e |
|
BLAKE2b-256 | b445116acaa0e9242103e5c23cea4f368a5516d96386795994f9187b92015727 |
File details
Details for the file web_scraper-1.0-py2-none-any.whl
.
File metadata
- Download URL: web_scraper-1.0-py2-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35f6600243771447ee726165cb8fd832ac4436b57ce7027fcf25cbb43da96686 |
|
MD5 | 58a1fdf6ce23d61e31242ced9d55c62d |
|
BLAKE2b-256 | 2601e3d461199c9341b7d39061c14b1af914654d00769241503a87f77505f95f |